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Preface 





This book (known as CS:APP) is for computer scientists, computer engineers, and 
others who want to be able to write better programs by learning what is going on 
“under the hdtd” of-a computer systéin. 

Our aim’is to explain the enduring concepts underlying all computer systems, 
and to show you the concrete ways that these ideas affect the correctness, perfor- 
mance,‘2nd utility of your application programs. Mány systems bobks are'written 
from a builders perspective, describing how to implement the hardware or tlie sys- 
tems softwaré, including the operating systém, compilér, and network'iüteiface. 
This book is written froma programmer's perspective, describing how application 
programmers can use their knowledge of a system to write better programs.'Of 
course, leárning what a system i$ supposed to do provides good first step in learn- 
ing how to build one,'so this book also serves as a'valuable introductiori to those 
who go on to implement systems hardwaré and software. Most systerhs books also 
tend to focus on just one aspect of the system, for example, the hardware archi- 
tecture; the operating system, the compilér, or thé ietwork. This book spans all 
of-these aspects! with the unifying theme of à progranímer's perspective. 

If you-study and-learir,the conceptsin-this book; youzwill-be"orryoür way to — - 
becoming the rdre power programmér who knows how things work'and how to 
fix them whén théy bteak. You will be able to write prograins that: make‘better 
use of the'capabilifies provided by the! operating systemi‘and systems software, 
that operate correctly acrossà wide range of operating conditions and rud-time 
parameters; that run faster, and that avoid the flaws that make: programs vulner- 
able fó cybefattack. ' 'You will be prepared to delve deeper into'advariced topics 
such as compilers; cómputer architecture, operating sy$tems, embedded systems, 
networking, and cybersecurity. 


Assumptions about the Reader's Backgrourid " 


à 

This book focuses on systems that execute x86-64 machine code. x86-64 is the latest 
in an evolutionary path followed by Intel and its competitors that started with the 
8086 microprocessor in 1978. Due to the naming conventions used by Intel for 
its microprocessor line, this class of microprocessors is referred to colloquially as 
“x86.” As semiconductor technology has evolved to allow more transistors to be 
integrated onto a single, chip, these processors have progressed greatly in their 
computing, power, ahd their, memory capacity. "As, part of; this progression, they 
have gone from operating on: 16-bit words, to.32-bit words with the introduction 
of IA32 processors, and most recently to 64-bit words with x86-64. 

We consider how these machines execute C programs on Linux. Linux is.one 
of a number'of operating systems'having their heritage in the Unix operating 
system developed originally by Bell Laboratories. Other members-of this class 
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of operating systems include Solaris, FreeBSD, and MacOS X. In recent years, 
these operating systems have maintained a high level of compatibility through the 
efforts of the Posix and Standard Unix Specification standardization efforts. Thus, 
the material in this book applies almost directly to these “Unix-like” operating 
systems. 

The,text contains numerous programming examples that have been compiled 
and run on Linux systems. We assume that you have access to such a machine, and 
are able to log in and do simple things such as listing files and changing directo- 
ries. If your computer runs Microsoft Windows, we recommend that you install 
one of the many different virtual machine environments (such as VirtualBox or 
VMWare) that allow programs written for one operating system (the guest OS) 
to run under another (the host OS). 

We also assume that you have some familiarity with C or C++. If your only 
prior experience is with Java, the transition will require more effort on your part, 
but we will help you. Java and C share similar syntax and control statements. 
However, there are aspects of C (particularly pointers, explicit dynamic memory 
allocation, and formatted I/O) that do not exist in Java. Fortunately, C is a small 
language, and it is clearly and beautifully described in the classic “K&R” text 
by Brian Kernighan and Dennis Ritchie [61]. Regardless of your programming 
background, consider K&R an essential part of your personal systems library. If 
your prior experience is with an interpreted language, such as Python, Ruby, or 
Perl, you will definitely want to devote some time to learning C before you attempt 
to use this book. 

Several of the early chapters in the book explore the interactions between C 
programs and their machine-language counterparts. The machine-language exam- 
ples were all generated by the GNU acc compiler running on x86-64 processors. : 
We do not assume any prior experience with hardware, machine language, or 
assembly-language programming. | 


How to Read the Book 


fun, mainly because you can do it actively. Whenever you learn somethihg new, 
you can try it out right away and see the result firsthand. In fact, we believe that 
the only way to learn systems is to do systems, either working concrete problems 
or writing and running programs on real systems. 

This theme pervades the entire book. When a new concept is introduced, it 
is followed in the text by one or more practice problems that you should work 





Learning how computer systems work from a programmer's perspective is great 
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T —— 3 — — tode/introfhello.c 
1 #include <stdio.h> 
2 
3 int main( 
4 vt 
5 printf("hello, world n"); 
6 return 0; 
So 
code/intro/hello.c 


r ! 
Figure 1 A typical code example. 


immediately to test your understanding. Solutions tó the practice problems are 

at the end of each chaptet. As you read, try to solve each problem on your own 

dnd then check the solution to make sure you are on the right track. Each chapter 

is followed by a sét of homework problems of varying difficulty. Your instructor 

has the solutiohs to the hómework problems in an ihistructór's manual. For each 

homework problem, we show a rating óf the amount of effort we feelit will require: 
d t 


€ Should require just a few minutes. Little or no programming required. 


$$ Might, require up, to 20 minutes. Often involves writing, and testing some 
code. (Many of these are’ derived from problems, we have given on exams.) 


+++ Requires a significanteffort, perhaps 1-2 hours. Generally involves writ- 
ing and testing a significant amount of code.. 


$9999 A lab assignment, requiring up to 10 hours of effort. 


Each code example in the téxt was formatted directly, Without, any manual 
intervention, from a C program compiled with GCC and tested on a Linux system. 
Of coursé, your system may have a differeht version of GCC, or a different compiler 
altogether, so your compiler might generate differént machine code; but the 
overall behavior should be the same. All òf thè, source code? available from the 
CS:APP Web page (“CS: APP” being oür shorthand for the book’ s title) at csapp 

.cs.cmu.edu. In the text, the filenarhés of the source programs are documented 
in horizontàl bars that sutround the formatted code. For example; the program in 
Figure 1 can be found in the file hello. c in directory code/intro/. We encourage 
youto try running the example programs on your system aš you encounter them. 

To avoid having a.book that is overwhelming, both in*bulk and in content, we 
have:created a^ number of.Web asides containing material that.supplements the 
main presentation ofthe book. These asides are referencedewithin the book with 
anotation'of the form cHAE:TOP, where CHAP is a short encoding of the chapter sub- 

ject, and Tor:is.a short code forithe topic that is covered.-For example, Web Aside 
DATA:BOOL contains supplementary material on-Boolearralgebra for the presenta- 
tion on data representations in Chapter 2, while Web Aside AnCH:VLOG contains 
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material describing processor designs using the Verilog hardware description lan- 


guage, supplementing the presentation of processor design in Chapter 4. All of 
these Web asides are available from the CS:APP Web page. 


Book Overview 


The CS:APP book consists of 12 chapters designed to capture the core ideas in 
computer systems. Here is an overview. 


amo 


Chapter 1: A Tour of Computer Systems. This chapter introduces the major ideas 
and themes in computer systems by tracing the life cycle of a simple “hello, 
world" program. 


Chapter 2: Representing and Manipulating Information. We cover computer arith- 
metic, emphasizing the properties of unsigned and two's-complement num- 
l ber representations that affect programmers. We consider how numbers 
t are represented and therefore what range of values can be encoded for 
a given word size. We consider the effect of casting between signed and 
unsigned numbers. We cover the mathematical properties of arithmetic op- 
erations. Novice programmers are often surprised to learn that the (two's- 
complement) sum or product of two positive numbers can be negative. On 
the other hand, two's-complement arithmetic satisfies many of the algebraic 
properties of integer arithmetic, and hence a compiler can safely transform 
M multiplication by a constant into a sequence of shifts and adds. We use the 
bit-level operations of C to demonstrate the principles and applications of 
Boolean algebra. We cover the IEEE floating-point format in terms of how 
it represents values and the mathematical properties of floating-point oper- 
ations. 

Having a solid understanding of computer arithmetic is critical to writ- 
ing reliable programs. For example, programmers and compilers cannot re- 
place the expression (x«y) with (x-y < 0), due to the possibility of overflow. 

i They cannot even replace it with the expression (-y < -x), due to the asym- 
metric range of negative and positive numbers in the two’s-complement 
representation. Arithmetic overflow is a common source of programming 
errors and security vulnerabilities, yet few other books cover the properties 

: of computer arithmetic from a programmer's perspective. 


Chapter 3: Machine-Level Representation of Programs. We teach you how to read 
the x86-64 machine code generated by a C compiler. We cover the ba- 
sic instruction patterns generated for different control constructs, such as 
conditionals, loops, and switch statements. We cover the implementation 
of procedures, including stack allocation, register usage conventions, and 

I parameter passing. We cover the way different data structures such as struc- 

1 tures, unions, and arrays are allocated and accessed. We cover the instruc- 

! tions that implement both integer and floating-point arithmetic. We also 

use the machine-level view of programs as a way to understand common 

code security vulnerabilities, such as buffer overflow, and steps that the pro- 
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grammer, the compiler, and the operating system can take to reduce these 
threats. Learning the concepts in this chapter helps you become a better 
programmer, because you will understand how programs are represented 
on a machine. One certain benefit is that you-will develop a thorough and 
concrete understanding of pointers. 


Chapter 4: Processor Architecture. This chapter covers basic combinational and 
sequential logic elements, and then shows how these elements,can be com- 
binedin a datapath that executes a simplified subset of the x86-64 instruction 
set called “Y86-64.” We begin with the design of a single-cycle datapath. 
This design is conceptually very simple, but it would not be very fast. We 
then introduce pipelining, where the different steps required to process an 
instruction are implemented as separate stages. At any given.time, each 
stage can work on a different instruction. Our fivesstage processor pipeline is 
much more realistic. The control logic for the processor designs is described 
using a simple hardware description language called HCL. Hardware de- 
signs written in HCL can be compiled and linked into simulators provided 
with the textbook, and they can be used to generate Verilog descriptions 
suitable for synthesis into working hardware. 


Chapter 5: Qptimizing Program Performance. This chapter introduces a number 
of techniques for improving code performance, with the idea being that pro- 
grammers learn to write their C code in such,a way that a compiler can then 
generate efficient machine code. We start with transformations that reduce 
the work to be done hy a program and hence should be standard practice 
when writing any ptogram for any machine. We then progress to trans- 
formations that enhance the degree of instruction-level parallelism in the 
generated machine code, thereby improving their performance on modern 
“superscalar” processors. To motivate these transformations, we introduce 
a simple operational model of how modern out-of-order processors work, 
and show how to measure the potential performance of a program in terms 
of the critical paths through a graphical representation of a program. You 
will be surprised how much you can speed up a program by simple transfor- 
mations of the C code. 
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Chapter 6: The Memory Hierarchy. The memory system is one of the most visible 
parts of a computer system to application programmers. To this point, you 
have relied on a conceptual model of the memory system as a linear array 
with uniform access times. In practice, a memory system is a hierarchy of 
storage devices with different capacities, costs, and access times. We cover 
the different types of RAM and ROM memories and the geometry and 
organization of magnetic-disk and-solid state drives. We describe how these 
storage devices are arranged in a hierarchy. We show how this hierarchy is 

| made possible by locality of reference. We make these ideas concrete by 

| introducing a unique view of a memory system as a “memory mountain" 
with ridges of temporal locality and slopes of spatial locality. Finally, we 
show you how to improve the performance of application programs by 
improving their temporal and spatial locality. 


Chapter 7: Linking. This chapter covers both static and dynamic linking, including 
the ideas of relocatable and executable object files, symbol resolution, re- 
location, static libraries, shared object libraries, position-independent code, 
and library interpositioning. Linking is not covered in most systems texts, 

Y but we cover it for two reasons. First, some of the most confusing errors that 
i programmers can encounter are related to glitches during linking, especially 
F for large software packages. Second, the object files produced by linkers are 
| tied to concepts such as loading, virtual memory, and memory mapping. 
i 
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Chapter 8: Exceptional Control Flow. In this part of the presentation, we step 
È beyond the single-program model by introducing the general concept of 
| exceptional control flow (i.e., changes in control flow that are outside the 
normal branches and procedure calis). We cover examples of exceptional 
control flow that exist at all levels of the system, from low-level hardware ex- 
ceptions and interrupts, to context switches between concurrent processes, 
to abrupt changes in control flow caused by the receipt of Linux signals, to 
| the nonlocal jumps in C that break the stack discipline. 

This is the part of the book where we introduce the fundamental idea 
of a process, an abstraction of an executing program. You will learn how 
processes work and how they can be created and manipulated from appli- 
cation programs. We show how application programmers can make use of 

1 multiple processes via Linux system calls. When you finish this chapter, you 

| will be able to write a simple Linux shell with job control. It is also your first 
introduction to the nondeterministic behavior that arises with concurrent 
program execution. 


Chapter 9: Virtual Memory. Our presentation of the virtual memory system seeks 

l to give some understanding of how it works and its characteristics. We want 
you to know how it is that the different simultaneous processes can each use 

! an identical range of addresses, sharing some pages but having individual 
\ copies of others. We also cover issues involved in managing and manip- 
ulating virtual memory. In particular, we cover the operation of storage 

allocator$ such as the standard-library malloc and free operations. Cov- 
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ering this material serves several purposes. It reinforces the concept that 
the virtual memory space is just an array of bytes that the program can 
subdivide into different storage units. It helps you understand the effects 
of programs containing memory referencing errors such as storage leaks 
and invalid pointer references. Finally; many application programmers write 
‘their own’ storage allocators optimized toward the needs and characteris- 
tics of the application. This chapter, more than any other, dem@nstrates the 
benefit of covering both the hardware and the software aspetts'of computer 
systems in a unified way. Traditional computer architecture and operating 
systems texts present only part of the virtual memory story. 


Chapter 10: System-Level I/O. We cover the basic concepts of Unix I/O such as 


files and descriptors. We describe how files are shared, how I/O redirection 
works, and how to access file metadata. We also develop a robust buffered 
I/O package that deals correctly with a curious behavior known as, short 
counts, where the library function reads only part of the input data. We 
cover the C standard I/O library and its relationship to Linux I/O, focusing 
on limitations of standard I/O that make it unsuitable for‘network program- 
ming. In general, the topics covered in this'chapter aré building blocks for 
‘the next two chapters on network and concurrent programming. 
`t 
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Chapter 11; Network Programming. Networks are interesting I/O devices to pro- 


gram, tying together many of the ideas that we study earlier in the text, such 
as processes, signals, byte ordering, memory mapping, and dynamic storage 
allocation. Network prograins Also provide a compelling context for con- 
currerit}, which is the topić df the next chapter. This chapter'is a thin slice 
through network programming that' gets you to the point where you can 
writé'à simple Web Server. We cover tlie client-server niodel that underlies 
all network applications. We present a programmer's View of the Internet 
and show how to write Internet clients and servers using the sockets inter- 
face. Finally, we introduce HTTP and develop a simple iterative Web server. 


Chapter 12: Concurrént Programming. ‘This chapter introduces concurrent pro- 


bos 


gramming using Internét'server design as thé tunifing motivational'example. 
We compare and contrast the three basic mechanisms for writing concur- 
rent programs processes, J/O multiplexing, and threads—and show how 
to use them to build concurrent Internet servers. We cover basic principles 
of synchronization using P and V semaphore operations, thrgad safety and 
reentrancy, race conditions, and deadlocks. Writing concurrent code is es- 
sential for most server, applications. We also, describe the use of thread-level 
programming to express parallelism in,an application, program, enabling 
faster execution on multi-core processors. Getting all of the cores working 
on a single computational problem requires a. careful coordination of the 
concurrent threads, both for correctness and to achieve high performance. 
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New to This Edition 


The first edition of this book was published with a copyright of 2003, while the 
second had a copyright of 2011. Considering the rapid evolution of computer 
technology, the book content has held up surprisingly well. Intel x86 machines 
running C programs under Linux (and related operating systems) has proved to 
be a combination that continues to encompass many systems today. However, 
changes in hardware technology, compilers, program library interfaces, and the 
experience of many instructors teaching the material have prompted a substantial 
revision. 

The biggest overall change from the second edition is that we have switched 
our presentation from one based on a mix of IA32 and x86-64 to one based 
exclusively on x86-64. This shift in focus affected the contents of many of the 
chapters. Here is a summary of the significant changes. 


Chapter 1: A Tour of Computer Systems We have moved the discussion of Am- 
dahl’s Law from Chapter 5 into this chapter. 


Chapter 2: Representing and Manipulating Information. A consistent bit of feed- 
back from readers and reviewers is that some of the material in this chapter 
can be a bit overwhelming. So we have tried to make the material more ac- 
cessible by clarifying the points at which we delve into a more mathematical 
style of presentation. This enables readers to first skim over mathematical 
details to get a high-level overview and then return for a more thorough 
reading. 


Chapter 3: Machine-Level Representation of Programs. We have converted from 
the earlier presentation based on a mix of IA32 and x86-64 to one based 
entirely on x86-64. We have also updated for the style of code generated by 
more recent versions of ccc. The result is a substantial rewriting, including 
changing the order in which some of the concepts are presented. We also 
have included, for the first time, a presentation of the machine-level support 
for programs operating on floating-point data. We have created a Web aside 
describing IA32 machine code for legacy reasons. 


Chapter 4: Processor Architecture. We have revised the earlier processor design, 
based on a 32-bit architecture, to one that supports 64-bit words and oper- 
ations. 


Chapter 5: Optimizing Program Performance. We have updated the material to 
reflect the performance capabilities of recent generations of x86-64 proces- 
sors, With the introduction of more functional units and more sophisticated 
control logic, the model of program performance we developed based on a 
data-flow representation of programs has become a more reliable predictor 
of performance than it was before. 


Chapter 6: The'Memory Hierarchy. We have updated the material to reflect more 
recent technology. 
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Chapter 7: Linking. We have rewritten this chapter, for x86-64, expanded the 
discussion of using the GOT and PLT to create position-independent code, 
and.added a new section on a powerful linking technique known as library 
interpositioning. 


Chapter 8: “Exceptional Control Flow., We have added a more rigorous treatment 
of signal handlers, including asyne-signal- -safe functions, specific guidelines 
for writing signal ] handlers, and using sigsuspend to wait for handlers. 


Chapter 9: Virtual Memory. This chapter has changed only slightly. 


Ghapter. 10: _ Sytem-Level VO. We have added 4 new section on files and the file 
hierarchy, but otherwise, this chapter has changed only slightly. 


Chapter 11: Network Programming. We have introduced techniques for protocol- 
independent and thread-safe network programming using the modern 
getaddrinfo and getnameinfo functions, which replace the obsolete and 
non-reentrant gethostbyname and gethostbyaddr functions. 


Chapter 12: Concurrent Programming. We have increased our coverage of using 
thread-level parallelism to make programs run faster on multi-core ma- 
chines, 


In addition, we have added and revised a number of practice and homework 
problems throughout the text. 


Origins of the Book 


This book stems from an introductory course that we developed at Carnegie Mel- 
lon University in the fall of 1998, called 15-213: Introduction to Computer Systems 
(ICS) [14]. The ICS coursé hasbeen taught every semester since then. Over 400 
students-take the course each semester. The students range from sophomores to 
graduate students in a wide variety of majors. It is a required core course for all 
undergraduates in the CS and ECE departments at Carnegie Mellon, and it has 
become a prerequisite for most upper-level systems courses in CS and ECE. 

The idea with ICS was to introduce students to computers in a different way. 
Few of our studerits would have the dpportunity to build a computer system. On 
the other hand, most students, including all computer scientists and computer 
engineers, would be required to use and program computers on a daily basis. So we 
decided to teach about systems from the póint of view of the programmer, using 
the following filter: we would cover a topic only if it affected thé performance, 
correctness, or, utility of user-level C programs. 

For example, topics such as hardwage adder and bus designs were out. Top- 
ics such as machine language were in; but instead of focusing on how to write 
assembly language by hand, we would look at how a C compiler translates C con- 
structs into.machine code, including pointers, loops, procedure calls, and switch 
statements. Further, we would take a broader and more holistic view of the system 
as both hardware and systems software, covering such topics as linking, loading, 
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processes, signals, performance optimization, virtual memory, I/O, and network 
and concurrent programming. 

This approach allowed us to teach the ICS course in a way that is practical, 
concrete, hands-on, and exciting for the students. The response from our students 
and faculty colleagues was immediate and overwhelmingly positive, and we real- 
ized that others outside of CMU might benefit from using our approach. Hence 
this book, which we developed from the ICS lecture notes, and which we have 
now revised to reflect changes in technology and in how computer systems are 
implemented. 

Via the multiple editions and multiple translations of this book, ICS and many 
variants have become part of the computer science and computer engineering 
curricula at hundreds of colleges and universities worldwide. 


For Instructors: Courses Based on the Book 


Instructors can use the CS:APP book to teach a number of different types of 
systems courses. Five categories of these courses are illustrated in Figure 2. The 
particular course depends on curriculum requirements, personal taste, and 
the backgrounds and abilities of the students. From left to right in the figure, 
the courses are characterized by an increasing emphasis on the programmer's 
perspective of a system. Here is a brief description. 


ORG. A computer organization course with traditional topics covered in an un- 
traditional style. Traditional topics such as logic design, processor architec- 
ture, assembly language, and memory systems are covered. However, there 
is more emphasis on the impact for the programmer. For example, data rep- 
resentations are related back to the data types and operations of C programs, 
and the presentation on assembly code is based on machine code generated 
by a C compiler rather than handwritten assembly code. 


ORG+. The ORG course with additional emphasis on the impact of hardware 
on the performance of application programs. Compared to ORG, students 
learn more about code optimization and about improving the memory per- 
formance of their C programs. 


ICS. The baseline ICS course, designed to produce enlightened programmers who 
understand the impact of the hardware, operating system, and compilation 
system on the performance and correctness of their application programs. 
A significant difference from ORG+ is that low-level processor architecture 
is not covered. Instead, programmers work with a higher-level model of a 
modern out-of-order processor. The ICS course fits nicely into a 10-week 
quarter, and can also be stretched to a 15-week semester if covered at a 
more leisurely pace. 


ICS+. The baseline ICS course with additional coverage of systems programming 
topics such as system-level I/O, network programming, and concurrent pro- 
gramming. This is the semester-long Carnegie Mellon course, which covers 
every chapter in CS:APP except low-level processor architecture. 








Chapter Topic " ORG QRG+ ICS ICS+ SP 
1 “Tour of systéms . tec é . ‘© 
Data representation "oe . $ . o%® 
3 Machine language, . ` . . . . 
4 Processor architecture ° ° 
5 Code optimization ° . . 
uô Memory hierarchy o . E e oG 
7 Linking in i © (c) o (o) e. 
8 Exceptional control, flow. ° . e. 
9 Virtual memory o0 . ° . e 
10 System-level I/O ° ° 
11 Network programming ° e 
12 Concurrent programming r . . 
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Figure2 Fivé systems Courses based on the CS:APP book. ICS+'is the 15-213 course 
from Carnegie Mellon’. Notes: The © symbol denotes partial coverage of a chapter, as 

follows: (a) hardware only; (b) no rime storagé Allocátion; (c) no dynamic'linking; 

(d) no floating point. 


SP. A systeins progtafnhing course. This cotirse is’ sintilar to ICS+, but iť drops 
floating point and perforniance ‘optimization, ‘and it places more empha- 
sis on systems ‘programming, including process control, dynamic linking, 
system-level I/O, network programming, and concurrent programming. In- 
structors mjght,want to supplement from other sources for advanced topics 
such as daergops,;terminal control, and Unix IPC. 


Ly 1 


The main message of ‘Figure 2 is that the CS: APP ‘book gives a lot of options 
to students and instructors. If you want your students to be exposed to lower- 
level processor architecture, then that option is available via the ORG and ORG 
courses. On the other hand, if you want to switch from your current computer 
organization course to an ICS or ICS+ course, but are wary of making such a 
drastic change all at once, then you:can move toward ICS incrementally. You 
can start with ORG, which teaches the traditional topics in a nontraditional way. 
Once you are comfortable with that material, then you can move to ORG+, 
and eventually to'ICS: If students have no experience iii C (e.g., they have only 
programmed i in Java),.you could spend several Weeks on C and then cover the 
material of ORG or ICS. 

Finally, we note that the ORG+ aiid SP course$ would make a nice two-term 
sequence (either quarters or semesters). Or you might consider offering ICS+ as 
one term of ICS and one term of SP. 
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For Instructors: Classroom-Tested Laboratory Exercises 


The ICS+ course at Carnegie Mellon receives very high evaluations from students. 
Median scores of 5.0/5.0 and means of 4.6/5.0 are typical for the student course 
evaluations. Students cite the fun, exciting, and relevant laboratory exercises as 
the primary reason. The labs are available from the CS:APP Web page. Here are 
examples of the labs that are provided with the book. 


Data Lab. This lab requires students to implement simple logical and arithmetic 
functions, but using a highly restricted subset of C. For example, they must 
compute the absolute value of a number using only bit-level operations. This 
lab helps students understand the bit-level representations of C data types 
and the bit-level behavior of the operations on data. 


Binary Bomb Lab. A binary bomb is a program provided to students as an object- 
code file. When run, it prompts the user to type in six different strings. If 
any of these are incorrect, the bomb “explodes,” printing an error message 
and logging the event on a grading server. Students must “defuse” their 
own unique bombs by disassembling and reverse engineering the programs 
to determine what the six strings shoyld be. The lab teaches students to 
understand assembly language and also forces them to learn how to use a 
debugger. 


Buffer Overflow Lab. Students are required to modify the run-time behavior of 
a binary executable by exploiting a buffer overfiow vulnerability. This lab 
teaches the students about the stack discipline and about the danger of 
writing code that is vulnerable to buffer overflow attacks. 


Architecture Lab. Several of the homework problems of Chapter 4 can be com- 
bined into a lab assignment, where students modify the HCL description of 
a processor to add new instructions, change the branch prediction policy, or 
add or remove bypassing paths and register ports. The resulting processors 
can be simulated and run through automated tests that will detect most of 
the possible bugs. This lab lets students experience the exciting parts of pro- 
cessor design without requiring a complete background in logic design and 
hardware description languages. 


Performance Lab. Students must optimize the performance of an application ker- 
nel function such as convolution or matrix transposition. This lab provides 
a very clear demonstration of the properties of cache memories and gives 
students experience with low-level program optimization. 


Cache Lab, In this alternative to the performance lab, students write a general- 
purpose cache simulator, and then optimize a small matrix transpose kernel 
to minimize the number of misses on a simulated cache. We use the Valgrind 
tool to generate real address traces for the matrix transpose kernel. 


Shell Lab, Students implement their own Unix shell program with job control, 
including the Ctrl+C and Ctrl+Z keystrokes and the fg, bg, and jobs com- 
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mands. This is the student'sifirst introduction to concurrency, and it gives 
them a clear idea of Unix process control, signals, and signal handling. 


Malloc Lab. Students implement their own versions'of malloc, free, and (op- 
tionally) realloc. This lab gives students a clear understanding of data 
layout and organization, and requires them to evaluaté different trade-offs 
between space and time, efficiency. 


Proxy LabStudents implement a concurrent; Web proxy that sits between their 

+> browsers and the rest of the World Wide Web. This lab exposes the students 

to such topics as Web clients and servers, and ties together many of the con- 

cepts from the course, such as byte ordering, file I/O, process control, signals, 

signal handling, memory mapping, sockets, and concurrency. Students like 

being able to see their programs in action with real Web browsers and Web 
Servers. TL 


The CS:APP instructor's manual has a detailed discussion of the labs, as well 
as directions for downloading the support software. 
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2 Chapter 1 


A Tour of Computer Systems 


AN system consists of hardware and systems software that work to- 
gether to run application programs. Specific implementations of systems 
change over time, but the underlying concepts do not. All computer systems have 
similar hardware and software components that perforin. similar functions. This 
book is written for programmers who want to get better at their craft by under- 
standing how these components work and how they affect the correctness and 
performance of their programs. 

You are poised for an exciting journey. If you dedicate yourself to learning the 

concepts in this book, then you will be on your way to becoming a rare *power pro- 
grammer,” enlightened by an undérstanding of the underlying computer system 
and its,impact on your application programs. 
* "You are, going tó learn practical skills such as how to avoid strange numerical 
errors caused by the way that computers represent numbers. You will learn how 
to optimize your C code by using clever tricks that exploit the designs of modern 
processors and memory systems. You will learn how the compiler implements 
procedure calls and how to use this knowledge to avoid the security holes from 
buffer overflow vulnerabilities that plague network and Internet software. You will 
learn how to recognize and avoid the nasty errors during linking that confound 
the average programmer. You will learn how fo write your own Unix shell, your 
own dynamic storage allocation package, and even your own Web server. You will 
learn the promises and pitfalls of concurrency, a topic of increasing importarice as 
multiple processor cores are integrated onto single chips. 

In their classic text on the C programming language [61], Kernighan and 
Ritchie introduce readers to Œ using the hello program shown in Figure 1.1. 
Although hello is a very simple program, every major part of the system must 
work in concert in order for it to run to completion. In a sense, the goal of this 
book is to help you understand what happens and why when you run hello on 
your system. 

We begin our study of systems by tracing the lifetime of the hello program, 
from the'time it is created by a programmer, until.it runs on a system, prints its 
simple message, and terminates. As we follow the lifetime of the program, we will 
briefly introduce the key concepts, terminology, and components that come into 
play. Later chapters will expand on these ideas. t 


S —— — ——— code/introfhello.c 


1 #include <stdio.h> 

2 

3 int main() 

4 1 

5 printf("hello, world\n"); 
6 return 0; 

7 } 


LM — — code/intro/hello.c 


Figure 1.1 The hello program. (Source: [60] 
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Section, L.1 Information 4s Bits + Context 


# i n c 1 u d e SP < S t d i 
35 105 110 99 108 117 100 101 32 60 115 116 100 105 


h > \n \n d n t SP m a i n ( ) 
104 62 10 10 105 410 116 32 109 97 105 110 40 41 


\n SP SP SP SP p r i n t £ < " h 
10 32 32 32 32 112 114 105 110 116 102 40 34, 104 


1 o ; SP w o r 1 d Y n " ) ; 
108 111 44 32 119 111 114 108 100 92 110 34 41 59 


SP SP SP r e t u r n SP O ; \ ) 
32 32 32 114 101 116 117 114 110 32 48 59 10 125 


Figure 1.2 The ASCII text representation of hello.c. 


1.1 Information Is Bits -- Context 


Our hello program begins life as a source program (or source file).that the 
programmer creates with an editor and saves in a text file called hello.c. The 
source program is a sequence of bits; each with a value of 0 or 1, organized in 8;bit 
chunks called bytes. Each byte represents some text character in the program. 

Most computer systems represent text characters using the ASCII standard 
that represents each character with a unique byte-size integer value.! For example, 
Figure 12 shows the ASCII representation of the hello.c program. 

The hello.c program is stored in a file as a sequence of bytes. Each byte has 
an integer value that corresponds to some character. For example,-the first byte 
has the integer value 35, which corresponds to the character i. The second byte 
has the integer value 105, which corresponds to the character ‘i’, and so on. Notice 
that each text line is terminated by the invisible newline character ‘\n’, which is 
represented by the integer value 10. Files such as hello.c that consist exclusively 
of ASCII cháractérs are known as text files. All other files are kríown as binary 
files. 

The representation of hello.c illustrates a fundamentalidea: All information 
in a system—including disk files, programs stored in memory, user data stored in 
memory, and data transferred across'a network—is represented a$a bunch of bits. 
The only thing that distinguishes different data objects is the context; in which 
we view them. For example, in different contexts, the same sequence of bytes 
might répresent an integer, floating-point number, character string, or machine 
instruction. A 

As programmers, we need to understand machine represéntations of numbers 
because they are not the same as integers and real numbers. They are finite 


Lud » 


1. Other encoding methods are used to represent text in non-English languages. See the aside on page 
50 for a discussion on this. 
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C was developed from 1969,to 1973 by Dennis Ritchie of Bell Laboratories. The-Aimerican National $ 
Standards Institute (ANSI) ‘ratified the ANSI C standardin 1989, ahd this staridardization later became i 
the responsibility of the International Standards Organization (ISO). The standards define thé C | 
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ANSIJC in their‘classic book, which is Known affectionately as “K&R” [61]. Ii Ritchie’s words [92],.C. 
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language and a set of library functions knówr'as the C standard library. Kernighan and Ritchie déScribe 


is “quirky, flawed, and anyenormous success.” So why the success? : 


RT C: was closely tiéd with the, Unix operating system. C'was -Uéveloped’ frdm the ‘beginning as thë | 
: system programming.language for Unix: Most of the Unix-kernel (the core part of the operating, 
system), and all of its supporting tools and Jibraries, weréwritten.in,C. As Unix became popular in ] 
' "universities in the late 1970s and early 1980s, many people were exposed to C and found that they « 
liked it. Since Unix was written almost entirely in4C, it could be easily ported to new machines; 
which created an éven wider audience for.both.C and Unix. & k | 
j 


Li 


Fi 


e Cisasmall, simple language. ‘The design was controlled by'a single pérsón, father tata committee, 3 
i and the result was a cleari, consistent design Y ‘with little, baggage. Th& K&R boók describes tlie : 


completé language and standardlibrary, witli hutherolis examples and exercises, in on ly 261 pages. 
The simplicity of Cinade it relatively easy to learn and tó port to differént’ Copiputers, < A 


e C was designed; fora practical purpose. 'C was désigned to implément the, Unix operating systeih. 
Láter, other people’ found that they could writé the prográris they wahted;without the language 
getting jn the way F d * : Tom 3 


x $ d 
C is the language. of choice for; Systemilevel prograniming, arid there is å hijs installed base of“ 
application-level programs as well. Howévér, it is not perfect for all prograthmers*and all situations.» 
C pointers are a commoh sotirce of confusion and programming errórs. G,also lacks explicit support 
, ‘for useful abstractiéns Sucli‘as Classes; Objects, and exceptions: Newer languages such'aş G++ and Java 


j address these issues for ‘epphcetion:leverproprans pow on” t 
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approximations that can behave in unexpected ways. This fundamental idea is 
explored in detail in Chapter 2. 


1.2 Programs Are Translated by Other Programs 
into Different Forms 


The hello program begins life as a high-level C program because it can be read 
and understood by human beings in that form. However, in order to run hello.c 
on the system, the individual C statements must be translated by other programs 
into a sequence of low-level machine-language instructions. These instructions are 
then packaged in a form called an executable object program and stored as a binary 
disk file. Object programs are also referred to as executable object files. 

On a Unix system, the translation from source file to object file is performed 
by a compiler driver: 
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Figure 1.3 The compilation system. 


linux? gcc -o hello hello.c 


Here, the ccc compiler driver reads the source file hello.c and translates it into 
an executable object file hello. The translation is performed in the sequence 
of four phases shown in Figure 1.3. The programs that perform the four phases 
(preprocessor, compiler, assembler, and linker) are known collectively as the 
compilation system. 


* Preprocessing phase. The preprocessor (cpp) modifies the original C program 
according to directives that begin with the *i character. For example, the 
#include <stdio.h> command in line 1 of hello.c tells the preprocessor 
to read the contents of the system header file stdio.h and insert it directly 
into the program text. The result is another C program, typically with the .i 
suffix. 

* Compilation phase. The compiler (cc1) translates the text file hello.i into 
the text file hello.s, which contains an assembly-language program. This 
program includes the following definition of function main: 


1 main: 

2 subq $8, ^rsp 

3 movl $.LCO, %edi 
4 call puts 

5 movl $0, %eax 

6 addq $8, žrsp 

7 ret 


Each of lines 2-7 in, this definition describes one low-level machine- 

, language instruction in a textual form. Assembly language is useful because 

it provides a common output language for different compilers for different 

high-level languages. For example, C compilers and Fortran compilers both 
generate output files in the same assembly language. 


e Assembly phase. Next, the assembler (as) translates hello.s into machine- 
language instructions, packages them in a form known as a relocatable object 
program, and stores the result in the object file hello.o. This file is a binary 
file containing 17 bytes to encode the instructions for function main. If we 
were to view hello.o with a text editor, it would appear to be gibberish. 
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Gcc is one of many useful tools developed by the GNU (short-for GNU's Not Unix) project. The 
GNU project is a tax-exempt charity started by Richard Stallman in 1984, with the ambitious goal of 
developing 4 complete Unix-like system whose source code is unencumbered by restrictions on how 
it can be modified or distributed. The GNU project has developed an environment with all the major 
components ‘of a Unix operating system, except for the kernel, which was developed separately by 
the Linux project. The GNU environment includes the Emacs editor, ccc compiler, cpp debugger, 
assembler, linker, utilities for manipulating binaries, and other components. The gcc compiler has 
grown to support many different languages, with the ability to generate code for many different 
machines. Supported languages include C, C++, Fortran, Java, Pascal, Objective-C, and Ada. 

The GNU project is a remarkable achievement, and yet it is often overlooked. The modern open- 
source movement (commonly associated with Linux) owes its intellectual origins to the GNU project's 
notion of free software ("free" as in “free speech,” not “free beer"). Further, Linux owes much of its 
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* Linking phase. Notice that our hello program calls the printf function, which 
is part of the standard C library provided by every C compiler. The printf 
function resides in a separate precompiled object file called printf.o, which 
must somehow be merged with our hello.o program. The linker (1d) handles 
this merging. The result is the he11o file, which is an executable object file (or 
simply executable) that is ready to be loaded into memory and executed by 
the system. 


1.3 It Pays to Understand How Compilation Systems Work 


For simple programs such as hello.c, we can rely on.the compilation system to 
produce correct and efficient machine code. However, there are some important 
reasons why programmers need to understand how compilation systems work: 


* Optimizing program performance. Modern compilers are sophisticated tools 
that usually produce good code. As programmers, we do not need to know 
the inner workings of the compiler in order to write efficient code. However, 
in order to make good coding decisions in our C programs, we do need a 
basic understanding of machine-level code and how the compiler translates 
different C statements into machine code. For example, is a switch statement 
always more efficient than a sequence of if-else statements? How much 
overhead is incurred by a function call? Is a while loop more efficient than 
a for loop? Are pointer references more efficient than array indexes? Why 
does our loop run so much faster if we sum into a local variable instead of an 
argument that is passed by reference? How'can a function run faster when we 
simply rearrange the parentheses in an arithmetic expression? 











In Chapter 3, we introduce x86-64, the machine language of recent gen- 
erations of Linux, Macintosh, and Windows computers. We describe how 
compilers translate different C constructs into this language. In Chapter 5, 
you will learn how to tune the performance of yqur C programs by making 
simple transformations tathe C code that help the.compiler do its job better. 
In Chapter 6, you will learn about the hierarchical nature of the memory sys- 
tem, how C compilers store data arrays in memory, and how your C programs 
can exploit this knowledge to run more efficiently. 


Understartding link-time errors. In our experience,;some of the most perplex- 
ing programming errors are related to the operation of the linker, especially 
when you are trying to build large software systems. For example, what does 
it mean'when the linker reports that it cannot resolve a reference? What is the 
difference between a static variable and a-global variable? What happens if 
you define two global.variables in different C files with the same name? What 
is the difference between a static library and a dynainic library? Why does it 
matter what order we list libraries on the command line? And scariest of all, 
why do somelinker-related errors not appear until run time? You will learn 
the answers to these kids of questions in Chapter 7. 


* Avoiding security holes. For many years, buffer overflow vulnerabilities have 
accounted for many of the security holes in network and Internet servers. 

., These vulnerabilities exist because too few programmers understand the need 
to carefully restrict the quantity and forms pf data they accept from untrusted 
solirces, A first s step in learning secyre programming is to understand the con- 
sequences of the way data and ‘control information are stored on the program 
stack. We cover the stack discipline and buffer overflow vulnerabilities in 
Chapter 3 35 part of our study of assembly language. We will also learn about 
methods that can be used by the programmer, compiler, and operating system 
to reduce the threat of attack. 


a 


0 - 


1.4 Processors. Read and Interpret Instructions 
Stored.in Memory 


a n 
At this point, our hello.c source program has been translated by the compilation 
system into an executable object file called hello that is stored on disk. To run 


the executable file on a Unix system, we type its name to an application program 
known as a shell: 


linux? ./hello 
hello, world "a 
linux? 
t3 
The shell.is a command-line interpreter that prints a prompt, waits for you 
to type a command 'line;.and then performs the command. If the first word of the 
command line does not correspond to a built-in shell command, then the shell 
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Figure 1.4 CPU y 
Hardware organization 
of a typical system. CPU: 
central processing unit, 
ALU: arithmetic/logic unit, 
PC: program counter, USB: 
Universal Serial Bus. 
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assumes that it is the name of an executable file that it should load’and run. So 
in this case, the' shell loads and runs the hello program and then waits for it to 
terminate. The hello program ptints i$ message to the scteen and thén terminates. 
The shell then prints a prompt aiid waits for the next input command line. 


1.4.1 Hardware Organization of a System : 


To understand what happens to our hello program when we run it, we need 
to understand the hardware organization of a typical system, which is shown 
in Figure 1.4. This particular picture-is modeled after the family of recent Intel 
systems, but all systems have a similar look and feel. Don't*worry about the 
complexity of this figure just now. We will get to its various details in stages 


throughout the course of the book. 


Buses 


Running throughout the system is a collection of electrical conduits called buses 
that carry bytes of information back and forth between the components. Buses 
are typically designed to transfer fixed-size chunks of bytes known as words. The 
number of bytes in a word (the word size) is a.fundamental system parameter that 
varies across systems. Most machines today have word sizes of either 4 bytés (32 
bits) or 8 bytes (64 bits). In this book, we do not assume any fixed definition of 
word size..Instead, we will specify what we mean by a “word” in any context that 


requires this to be defined. 

















I/O Devices 


Input/output (I/O) devices are the system’s connection to the external world. Our 
example system has four I/O devices: a keyboard and mouse for user input, a 
display for user output, and a disk drive (or simply disk) for long-term storage of 
data and programs. Initially, the executable hello program resides on the disk. 

Each I/O device is connected to the I/O bus by either a controller or an adapter. 
The distinction between the two is mainly one of packaging. Controllers are chip 
sets in the device itself or on the system's main printed circuit board (often called 
the motherboard). An adapter is a card that plugs into a slot on the motherboard. 
Regardless, the purpose of each is to transfer information back and forth between 
the I/O bus and an Y/O device. 

Chapter 6 has more to say about how I/O devices such as disks work. In 
Chapter 10, you will learn how to use the Unix I/O interface to access devices from 
your application programs. We focus on the especially interesting class of devices 
known as networks, but the techniques generalize to other kinds of devices as well. 


Main Memory 


The main memory is a temporary storage device that holds both a program and 
the data it manipulates while the processor is executing the program. Physically, 
main memory consists of a collection of dynamic random access memory (DRAM) 
chips. Logically, memory is organized as a linear array of bytes, each with its own 
unique address (array index) starting at zero. In general, each of the machine 
instructions that constitute a program can consist of a variable number of bytes. 
The sizes of data items that correspond to C program variables vary according 
to type. For example, on an x86-64 machine running Linux, data of type short 
require 2 bytes, types int and float 4 bytes, and types long and double 8 bytes. 

Chapter 6 has more to say about how memory technologies such as DRAM 
chips work, and how they are combined to form main memory. 


Processor 


The central processing unit (CPU), or simply processor, is the engine that inter- 
prets (or executes) instructions stored in main memory. At its core is a word-size 
storage device (or register) called the program counter (PC). At any point in time, 
the PC points at (contains the address of) some machine-language instruction in 
main memory 

From the time that power is applied to the system until the time that the 
power is shut off, a processor repeatedly executes the instruction pointed at by the 
program counter and updates the program counter to point to the next instruction. 
A processor appears to operate according to a very simple instruction execution 
model, defined by its instruction set architecture. In this model;instructions execute 


2, PC is also a commonly used acronym for “personal computer." However, the distinction between 
the two should be clear from the context. 
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in strict sequence, and executing a single instruction involves performing a series 
of steps. The processor reads the instruction from memory pointed at by the 
program counter (PG), interpréts the bits in the instruction, pérforms some simple 
operation dictatéd by the instruction, and then up dates the PC to point to the,next 
instruction, which 1 may or may not be contiguous in memory to the instruction that 
was just executed. 

There are oily a few of these simple operations, and they revolve ‘around 
main memory, the register file, and the arithmetic/logic unit (ALU). Tlie register 
file is a small storage device that consists of a collection of word-size tegistérs, each 
with its own unique name. The ALU computes new data and address values.'Here 
are some examples of the simple operations that the CPU might carry out at the 
request of an instruction: 


* Load: Copy a byte or a word from main memory into a register, overwriting 
the previous contents of tlie register. 


* Store: Copy a byte or a word from a register to a' location in main memory, 

overwriting the previous contents of that location. 
| 
t 


¢ Operate: Copy the contents of two registers to the ALU, perform an arithmetic 
operation on the two words, and store the result in a register, overwriting the 
previous contents of that register. 


¢ Juntp: Extract a word from thé! instruction itself and copy that word into'the 
program counter (PC), overwriting the previous value of the PC. 


We say that a processor appears ito be a simple,implementation ofits in; 
struction set architecture; but in fact modern processors usefar more complex 
mechanisms to speed up program, execution. Thus, we can distinguish the pro- 
cessor's instruction, set architecture, describing the effect of each machine-code 
instruction, from its microarchitecture, describing how the processor is actually 
implemented. When we study machine code-in Chapter 3, we will, consider the 
abstraction provided by the machine's instruction set architecture. Chapter 4 has 
more to say about how processors are actually implemented. Chapter 5 describes 
a model of how modern processors work that enables predicting and optimizing i 
the performancë of machine-languagé programs. 

E] 


1.4.2 Rurihing the ne11v Program 
f. 


Given this simple view of a system's hardware organization and operation, we can 
begin.to understand what happens when we run our example program. We must 
omit a lot of details here;that will be filled in later, but for now we will be content 
with the big picture. i " 
Initially, the shell program is executingits instructions, waiting for us to type a 
f command. As we type the characters ./hełtlæat the keyboard, the shell program 
| reads each one into a register and then stores it in memory, as shown in Figure 1.5. 
When we hit the enter key on the keyboard, the shell knows that we have 
finished typing the command. The shell then loads the executable he11o file by 
executing a sequence of instructions that copies the code and data in the hello 


f 
i 
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object file from disk to main memory. The data includes the string of characters 
hello, world\n that will eventually be printed out. 

Using a technique known as direct memory access (DMA, discussed in Chap- 
ter 6), the data travel directly from disk to main memory, without passing through 
the processor. This step is shown in Figure 1.6. 

Once the code and data in the hello object file.are loaded,into memory, 
the processor begins executing the machine-language instructions in the hello 
program's main routine. These instructions copy the bytes in the hello, world n 
string from memory to the register file, and from there to the display device, where 
they are displayed on the screen. This step is shown in Figure 1.7. 


1.5 Caches Matter 


An important lesson from this simple example is that a system spends a lot of 
time movihg information from One’place to another. The machine instructions in 
the hello program are originally stored on disk. When the program is loaded, 
they are copied to main Jnemory. As the processor runs the program, instruc- 
tions are copied from main memory into the processor. Similarly, the data string 
hello,world\n, originally on disk, is copied to main memory and then copied 
from main memory to the display devicé. From à programmer's perspective, much 
of this copying is overhead that slows down the “real work" of the program. Thus, 
a major goal for system designers is to make these Copy operations run as fast as 
possible. 

Because of physical laws, larger storage devices'are slower than smaller stor- 
age devices. And faster devices are more expensive to build than their slower 
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Figure 1.8 CPU chip 
Cache memories. 


Register file 





Bus interface 





counterparts. For example, the disk drive on a typical system might be 1,000 times 
larger than the main memory, but it might take the processor 10,000,000 times 
longer to read a word from disk than from memory. 

Similarly, a typical register file stores only a few hundred bytes of information, 
as opposed to billions of bytes in the main memory. However, the processor can 
read data from the register file almost 100 times faster than from memory. Even 
more troublesome, as semiconductor technology progresses over the years, this 
processor—nemory gap continues to increase. It-is easier and cheaper to make 
processors run faster than it is to make main memory run faster. 

To deal with the processor-memory gap, system designers include smaller, 
faster storage devices called cache memories (or simply caches) that serve as 
temporary staging areas for information that the processor is likely to need in 
the near future. Figure 1.8 shows the cache memories in a typical system. An L7 
cache on the processor chip holds tens of thousands of bytes and can be accessed 
nearly as fast as the register file. A larger L2 cache with hundreds of thousands 
to millions of bytes is connected to the processor by a special bus. It might take 5 
times longer for the processor to access the L2 cache than the L1 cache, but this is 
still 5 to 10 times faster than accessing the main memory. The L1 and L2 caches are 
implemented with a hardware technology known as static random access memory 
(SRAM). Newer and more powerful systems even have three levels of cache: L1, 
L2, and L3. The idea behind caching is that a system can get the effect of both 
a very large memory and a very fast one by exploiting locality, the tendency for 
programs to access data and code in localized regions. By setting up caches to hold 
data that are likely to be accessed often, we can perform most memory operations 
using the fast caches. 

One of the most important lessons in this book is that application program- 
mers who are aware of cache memories can exploit them to improve the perfor- 
mance of their programs by an order of magnitude. You will learn more about 
these important devices and how to exploit them in Chapter 6. 
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l Figure 1.9 An example of a memory hierarchy. Tee 


1.6 Storage Devices Form a Hierarchy 


This notior of inserting a srñaller, faster’ storage device (e. g., cache nieniory) 
! between the processor and a larger, Slower device (e.g., main'memory) turns otit 
to’ be a general idea. In"fact, the storage devices'in every computer system are 
organized as,a memory hierarchy similar to Figuie 1.9. As we move from the top 
of the hierarchy to the bottom, the devices become slower, larger, and less costly 
i per byte. The register file occupies the top level in the hierarchy, which is kiiown 
as level 0 or LO. We show three levels of caching L1 to L3, occupying memory 
hierarchy levels 1 to 3. Main memory occupies level 4, and $0 oh. 
The main idea of a memory hierarchy is that storage at one level serves as a 
1 cache for storage at the next lower level. Thus, the register file is a cache for the 
L1 cache. Caches L1 and L2 are caches for L2 ‘and L3, respectively. The L3 cache 
is a cache for the main memory, which is a cache for the disk. On some networked 
systems with distributed file systems, the loCal'disk serves as'avache for data stored 
Ön the disks of other systems. * 
i Just a$ programniers can exploit Knowledge of the different ċachès to improve 
pelfórmance, programmers čar exploit their understanding of the entire memory 
i hierarchy. Chapter 6 will have much more to say about this. ; 
H "oY ^ 


i 1.7 The Operating System Managés the Hardware 


Back to our hello example? When the shell Idaded and ran the hello program, 
and when the hello program printed its message, neither program accessed the 
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keyboard, display, disk, or main memory directly. Rather, they relied on the 
services provided by the operating system. We can think of the operating system as 
alayer of software interposed between the application program and the hardware, f 
as shown in Figure 1.10. All attempts by an application program to manipulate the’ 
hardware must go through the operating system. 

The operating system has two primary purposes: (1) to protect the hardware 
from misuse by runaway applications and (2) to provide applications with simple 
and uniform mechanisms for manipulating complicated and often wildly different 
low-level hardware devices. The operating system achieves both goals via the 
fundamental abstractions shown in Figure 1.11: processes, virtual memory, and 
files. As this figure suggests, files are abstractions for I/O devices, virtual memory 
is an abstraction for both the main memory and disk I/O devices, and processes 
are abstractions for the processor, main memory, and I/O devices. We will discuss 
each in turn. d 


1.7.1 Processes 
F 


When a program such as hello runs,on a modern system,.the operating system 
provides the illusion.that the program is the only. one running on the system. The 
program appears to have exclusive use of both the processor, main memory, and 
I/O devices.The- processor appears to execute the instructions in the program, one 
after the other, without interruption. And the code and data of the program appear 
to be the only objects in the system's memory. These illusions are provided by the 
notion of a process, one of the most important and successful ideas in computer 
Science. " 

A.process is the operating system's abstraction for a running program. Multi- 
ple processes can run concurrently on the same system, and each process appears 
to have exclusive use of the hardware. By concurrently, we meanthat the instruc- 
tions of one process are interleaved,with the instructions of another process. In 
most systems, there are more processes to run than there are CPUs to run them. 
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The 1960s was an erà of huge, complex operdting systems, such’ ås IBM's OS/360 and Honeywell’ ` 
1 Multics-systems. While OS/360 was one'of thé.most successful software projects in history, Multics 
J . dragged on for years and never achieved wide-scale use. Bell Laboratories was an original partner in 
the Multics project but dropped out in 1969 because, of concern over the-coniplexity of the project 
and the lack of progress..In reaction to their unpleasant Multics experience, a group of Bell Labs % 
researchers —Keén Thompson, Dennis Ritchie, Doug Mcliroy, and Joe Ossanna-began work in 1969 
on a simpler op erating system for a Digital Equipment Corporation PDP-7 computer,Ayritteri entirely ' 
in machine language. Many Of the ideas in the new system, such, as the hierarchical file system and the 
notion of a shell as à user-level-process, were borrowed from Multics:but implemented in a smaller, 
simpler package. Iri 1970; Brian Kernighan d dubbed the- new'system “Unix” as a pun on the corhplexity 
of “Multics.” The.kernel was ‘rewritten in Cin 1973, “and Unix was annouticed to the outside world in 
1974 [93]. 
Because Bell Labs made the-source code available to &chóols with generous terms, Unix developed 
a large following at universities. The, most influential work was done“at the Univeisity of. California 
at Berkeley in the late, 1970s and early 1980s, with’ Berkejey researchers adding: virtual memory . and 
thé Internet protocols i in.à "series of, releasés cálled Unix 4, xBSD (Berkeley Softwáre"Distribution), 
Concurrently, Bell. abs’ was releasing ‘their ‘own versions, whith became, Known as System V Unix. 
Versions from othér vendor, such as the Sum Microsystems" Solaris System were derived from” thése * 
original BSD and System V.versioris. T 
| Trouble arose in-the mid’ 1980s as Unix Vendors tried; tò differentiate theméélves by adding new | 
and often incompatible featurés. " To conibat this trend, IEEE Unititute for Electrical arid Electron- 
ics Engineers) sporisoréd-àn ‘effort tà standardize’ "Uii latet dubbed *Posix" by Richard Stallman. 
The result was a family of standards, Knowh à as tlie Posix standátds, that cover ' sucH*issues*as the Č 
language interfacé for Unix system calls, , shell programs: and utilities, thréads, and hetwork’Program- i 
ming. More recently, 4 a | separate standardization effort, -kiiown as the “Standard, Unix Specification," $ 
has joined forces with Posix to'create a single, unified standard fof Unix systems, As a 'result'of these 
standardization efforts, the difference’ between Ufiix Versiotis liáve largely disappeared. 
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Traditional systems could only execute one program at a time, while newer multi- 
core processors can execute several programs simultaneously. In either case, a 
single CPU can appear to execute multiple processes concurrently by having the 
processor switch among them. The operating system performs this interleaving 
with a mechanism known as context switching. To simplify the rest of this discus- 
sion, we consider only a uniprocessor system containing a single CPU. We will 
return to the discussion of multiprocessor systems in Section 1:92. 

The operating system keeps track of all the state information that the process 
needs in order to run. This state, which is known as the context, includes informa- 
tion such as the current values of tbe PC, the register file, and the contents of main 
memory. At any point in time, a uniprocessor system can only execute the code 
for a single process. When the operating system decides to transfer control from 

| the current process to some new process, it performs a context switch by saving 
the context of the current process, restoring the context of the new process, and 
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then passing control to the new process. The new process picks up exactly where 
it left off. Figure 1.12 shows the basic idea for our example hello scenario. 

There are two concurrent processes in our example scenario: the shell process 
and the hello process. Initially, the shell process is running alone, waiting for input 
on the command line. When we,ask it to run the hello program, the shell carries 
out our request by invoking a special function known as a system call that passes 
control to the operating system. The operating system saves the shell's context, 
creates a new hello process and its context, and then passes control to the new 
hello process. After hello terminates, the operating system restores the context 
of the shell process and passes control back to it, where it waits for the next 
command-line input, 

As Figure 1.12 indicates, the transition from one process to another is man- 
aged by the operating system kernel. The kernel is the portion of the operating 
system code that is always resident in memory. When an application program 
requires some action by the operating system, such as to read or write a file, it 
executes a special system call instruction, transferring control to the kernel. The 
kerne] then performs the requested operation and returns back to the application 
program. Note that the kernel is not,a separate process, Instead, it is a collection 
of code and data structures that the system uses to manage all the processes. 

Implementing the, process abstraction requires close cooperation between 
both the low-level hardware and the operating system software. We will explore 
how this works, and how applications can create and control their own processes, 
in Chapter 8. 


1.7.2 Threads 


Although we normally think of a process as having a single control flow, in modern 
systems a process can actually consist'óf multiple exécution units, called threads, 
each running in the context of thé process and sharing the same code and global 
data. Threads are an increasingly important Programming model because of the 
requirement for concurrency in network servers, becaüserit is easier to share data 
between multiple threads than between multiple processes, and because threads 
are typically more efficient than processes. Multi-threading is also one way to make 
programs-run faster when multiple processors are available, as we will discuss in 
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Section 1.9.2. You will learn the basic concepts of concurrency, including how to 
write threaded programs, in Chapter 12. 


1.7.3 Virtual Memory 


Virtual memory is an abstraction that provides each process with the illusion that it 
has exclusive use of the main memory. Each process has the same uniform | View of 
memory, which is known as its virtual address space. The virtual address space for 
Linux processes is shown in Figure 1.13. (Other Unix systems use a similar layout. ) 
In Linux, the topmost region of the address space is reserved fot code and data 
in the operating system that is common to all processes. ‘The lower fegion of the 
address space holds the códe and data defined by'the user's process. Note that 
addresses in the figure incréase from the bottom to'the’top. bs 

The virtual address space seen by each process consists of a number of well: 
defined areas, each with a specific purpose: You will learn more about these areas 
later in the book, but it will be helpful to look briefly at each, starting with the 
lowest addresses and working our way up: 


e Program code and data. Code begins at the samevfixed address for all processes, 
followed by data locations that correspond to global C variables. The code and 
data-ayeas are initialized directly from the contents of an executable object 
file—in,our case, the;hello executable. You will learn more-about this part of 
the address space when we study linking and loading in Chapter 7. 


e Heap. The code and data areas are followed immediately by therun-time heap. 
Unliké the code and data areas, which'are fixed in size once the process begins 

















running, the heap expands and contracts dynamically at run time as a result 
of calls to C standard library routines such as malloc and free. We will study 
heaps in detail when we learn about managing virtual memory in Chapter 9. 


Shared libraries, Near the middle of the address space is an area that holds the 
code and data for shared libraries such as the C standard library and the math 
library. The notion of a shared library is a powerful but somewhat difficult 
concept. You will learn how they work when we study dynamic linking in 
Chapter 7. 


Stack. At the top of the user's virtual address space is the user stack that 
the compiler uses to implement function calls. Like the heap, the user stack 
expands and contracts dynamically during the execution of the program. In 
particular, each time we call a function, the stack grows. Each time we return 
from a function, it contracts. You will learn how the compiler uses the stack 
in Chapter 3. 


Kernel virtual memory. The top region of the address space is reserved for the 
kernel. Application programs are not allowed to read or write the contents of 
this area or to directly call functions defined in the kernel code. Instead, they 
must invoke the kernel to perform these operations. 


For virtual memory to work, a sophisticated interaction is required between 
the hardware and the operating system software, including a hardware translation 
of every address generated by the processor. The basic idea is to store the contents 
of a process’s virtual memory on disk and then use the main memory as a cache 
for the disk. Chapter 9 explains how this works and why it is so important to the 
operation of modern systems. 


1.7.4 Files 


A file is a sequence of bytes, nothing more and nothing less. Every I/O device, 
including disks, keyboards, displays, and even networks, is modeled as a file. All 
input and output in the system is performed by reading and writing files, using a 
small set of system calls known as Unix I/O. 

This simple and elegant notion of a file is nonetheless very powerful because 
it provides applications with a uniform view of all the varied I/O devices that 
might be contained in the system. For example, application programmers who 
manipulate the contents of a disk file are blissfully unaware of the specific disk 
technology. Further, the same prograní will run ón different systems that use 
different disk technologies, You will learn about Unix I/O in Chapter 10. 


1.8 Systems Communicate with Other Systems 
Using ‘Networks 


Up to this point in our tour of systems, we have treated a system as an isolated 
collection of hardware and software. In practice, modern systems are often linked 
to other systems by networks. From the point of view of an individual system, the 
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network can be viewed as just another I/O device, as shown in Figure 1.14. When 
the system copies a sequence of bytes from main memory to the network adapter, 
the data flow across the, network to another machine, instead of, say, to a local 
disk drive. Similarly, the system can read data sent from other machines and copy 
these data to its main memory. 

With the advent of global networks such as the Internet, copying information 
from one machine to another has become one of the most important uses of 
computer systems. For example, applications such as email, instant messaging, the 
World Wide Web, FTP, and telnet are all based on the ability to copy information 
over a network. 
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Figure 1.14 CPU chip 
A network is another I/O Register file 
device. : 
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Figure 1.15 Using telnet to run hello remotely over a network. 


Returning to our hello example, we could use the familiar telnet application 
to run hello ona remote machine.Suppose we use a telnet client running on our 
local machine to connect to a telnet server on a remote machine. After we log in 
fo the remote machine and run a shell, the remote shell is waiting to receive an 
input command. From this point, running the hello program remotely involves 
the five basic steps shown in Figure 1.15. MIC , 

After we type in the hello string to the telnet client and hit the enter key, 
the client sends the string to the telnet server. After the,telnet server receives the 
string from the network, it passes it along to the remote shell program, Next, the 
remote shell runs the hello program and passes the output line back to the telnet 
server. Finally, the telnet server forwards the output string across the network to 
the telnet client, which prints the output string on our local terminal. 

This type óf exchange between clients and servers is. typical of all network 
applications. In Chapter 11 you will learn how to' build network applications and 
apply this knowledge to build a simple Web server. 
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1.9 Important Themes 


a 

! This concludes our initial whirlwind tour of systems. An important idea to take 

j away from this discussion is that a system is more than just hardware. It is a 
1 collection of intertwined hardware and systems software that must cooperate in 
order to achieve the ultimate goal of running application programs. The rest of 
this book will fill in some details about the hardware and the software, and it will 
show how, by knowing these details, you can write:programs that are faster, more 

_ reliable, and more secure. 

| To close out this chapter, we highlight several important concepts that cut 1 

| across all aspects of computer systems. We will discuss the importance of these 

concepts at multiple places within the book. 


1.9.1 ,Amdahl's Law 


Gene Amdahl, one of the early pioneers in computing, made a simple but insight- 
ful observation about the effectiveness of improving the performance of one part 
of a system. This observation has come to be known as Amdaht’s law. The main 
idea is that when we speed up one part of a system, the effect on the overall sys- 
tem performance depends on both how significant this part was and how much i 
it sped up. Consider a system in which executing some application requires time | 
d Tog. Suppose some part of the system requires a fraction a of this time, and that : 
we improve its performance by a factor of k. That is, the component originally re- 
quired time «Toa, and it now requires time (@T,)q)/k. The overall execution time 
would-thus be 


1 
Trew = (1 — a) Tag + (@ Tag) /k | 
= Toull — o) + œ/k] | 


From this, we can compute the speedup $ = Tojq/Thew as 


i 
TN JM (1.1) | 
(l—q)+a/k s 
As an example, consider the case where a part of the system that initially 
4 consumed 60% of the time (æ = 0.6) is sped up by a factor of 3 (k= 3). Then 
i we get a Speedup of 1/[0.4 + 0.6/3] = 1.67 x. Even though we made a'substantial 
f improvement to a major part of the system, our net speedup was significantly less 
than the speedup for the one part. This is the major insight of, Amdahl’s law— 
io significantly speed up the entire system, we must ifnprove the speed of'a very 
large fraction of the overall system. 





l Suppose j you work as a Struck rivet anal you have been hired to carry a load of 

potatoes from Boise, Idaho, to Minneapolis, Minnesota, a total distance of 2,500 ! 
kilometers. You estimate you can average 100 km/hr driving within the speed 
limits, requiring a total of 25 hours for the trip. 
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A. You hear on the news that Montana has just abolished ifs speed limit, which 
: constitutes 1,500 km of the trip. Your truck can travel at 150 km/hr. What 
willbe your speedup for the trip? 


B. 'Youcan buy a new turbocharger for your truck at www.fasttrucks.com. They 
Stock a variety of models, but thé faster you want to go, the more it will cost. 
How fast must ‘you travel through Montana to gét an overall speedup for 

* your trip of 1.67x? 





The marketing department: at your company has onki your customers that 
the next software release will show a 2x performance improvement. You have 
been agsigned the task of delivering on that promise. You have determined that 
only 80% of the system can be improved.'How much (i.e., what value of k) would 
you need to improve this part to meet the overall performance target? 


A 

, One interesting specjal case of Amdahl’s law is to consider the effect of setting 
k to oo. That is, we are able to take some part of the system and speed it up to the 
point,at which it takes a negligible amount of time. We then get 


1 
(1 — a) 





(1.2) 


oo = 

So, for example, if we can speed up'60% of the system to the point where it requires 
close to no time, our net speedup will still only be 1/0.4 =2.5x. 

Amdahl's law describes a general principle for improving any process. In 

addition toits application to speeding up computer systems,it can guide a company 


trying to reduce the cost of manufacturing razor blades, or a student trying.to 
improve his or her grade point average. Perhaps it is most meaningful in the world 
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of computers, where we routinely improve performance by factors of 2 or more: 
Such high factors can only be achieved by optimizing large parts of a system? 


i h 
d 1.9.2 Concurrency and Parallelism 


E Throughout the history of digital computers, two demands have been constant 
forces in driving improvements: we want them to do,more, and we want them to 
run faster. Both of these factors improve when the processor does more things at 


" 
! once, We use the term concurrency to refer to the general concept of a system with 

(5 multiple, simultaneous activities, and the term parallelism to refer to the use of 
| concurrency to make a system run faster. Parallelism can be exploited at multiple 
| 


levels of abstraction in a computer system. We highlight three levels here, working 
from the highest to the lowest level in the system hierarchy. 


i Thread-Level Concurrency 

l Building onthe process abstraction, we are able to devise systems where multiple 

i Programs execute at the same time, leading to concurrency. With threads, we 
can even have multiple control flows executing within a single process. Support 

| for concurrent execution has been found in computer systems since the advent 


of time-sharing in the early 1960s. Traditionally, this concurrent execution was 
only simulated, by having a single computer rapidly switch among its executing 
| processes, much as a juggler keeps multiple balls flying through the air. This form 
of concurrency allows multiple users to interact with a system at the same time, 
i such as when many people want to get pages from a single Web server. It also 
r allows a single user to engage in multiple tásks concurrently, such as having a 
y Web browser in one window, a word processor in another, and streaming music 
t playing at the same time. Until recently, most actual computing was done by a 
1 single processor, even if that processor had to switch among multiple tasks. This 

! configuration is known as a uniprocessor system. , © 
a When we construct a system consisting of multiple processors all under the 
i control of a single operating system kernel, we have a multiprocessor system. 
Such systems have been available for large-scale computing since the 1980s, but 
they have more recently become commonplace with the advent of multi-core 
processors and hyperthreading. Figure 1.16 shows a taxonomy of these different 

| processor types: 

Multi-core processors have several CPUs (referred to as "cores") integrated 
onto a single integrated-circuit chip. Figure 1.17 illustrates the organization of a 


l Figure 1.16 All processors 
À, Categorizing different Multiprocessors 
processor configurations. 3 
Ai Multiprocessors are 
! becoming prevalent Uhiprocessors 
with the advent of multi- 
core processors and 
hyperthreading. 
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Figure 1.17 Processor package 
Multi-core processor 
organization. Four 
processor cores are 
integrated onto a single 
chip. 





J 
typical multi-core processor, where the chip has:four CPU cores, each with its 
own L1 and L2 cachės, and with each L1 cache split into two parts—one to hold 
recently fetched instructions and one to hold data. The cores share higher levels of 
cache as well'as the interface to main memory. Industry experts predict that they 
will be able to have dozens, and ultimately hundreds, of cores on a single chip. 

Hyperthreading, sometimes called simultaneous multi-threading, is a tech- 
nique that allows a single, CPU, to execute multiple flows of control. It involves 
having multiple copies of some of the CPU hardware, such as program counters 
and register-files, while: liavitfg only single copies of other parts of the hardware, 
such as the'units that perform floating-point arithmetic. Whereas a conventional 
processor requires around 20000 clock cycles to shift between different threads, 
a hyperthreaded processor decides which of its threads to execute on a cycle-by- 
cycle basis. It enables the CPU to take'better advantage of its processing resources. 
For example, if one thread must wait for some data to be loaded into a cache, the 
CPU can proceed with the execution of a different thread. As an example, the In- 
tel Core i7 processot can have each core executing two threads, and so a four-core 
system can actually execute eight threads in parallel. 

The use of:multiprocessing can improve system performance in two ways. 
First, it reduces the need to simulate'concurrency when performing multiple tasks. 
As mentioned, even a personal computer being used by a single person is expected 
to perform many ativities concurrently. Second, it can run a single application 
program faster, but only if that program is expressed in terms of multiple threads 
that can effectively execute in parallel. Thus, although the principles of concur- 
rency have been formulated and studied for over 50 years, the advent of multi-core 
and hyperthreaded systems has greatly increásed the desire to find ways to write 
application programs that can exploit the thread-level parallelism available with 
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the hardware. Chapter 12 will look much more deeply into concurrency and, its 
use to provide a sharing of processing resources and to enable more parallelism 
in program execution. : " » 


p —P" 


& 


Instruction-L evel Parallelism 


At a much lower level of abstraction, modern processors can execute multiple 

instructions at one time, a property known as instruction-level parallelism. For 

example, early microprocessors, such as the 1978-vintage Intel 8086, required 

multiple (typically 3-10) clock cycles to execute a single instruction. More recent 

: processors can sustain execution rates of 2-4 instructions per clock cycle. Any 

given instruction requires much longer from start to finish, perhaps 20 cycles or 

more, but the processor uses a number of clever tricks to process as many as 100 

instructions at a time. In Chapter 4, we will explore the use of pipelining, where the 

| actions. required to execute an instryiction are partitioned into different steps and 

the processor hardware is organized as a series of stages, each performing one 

| of these steps. The stages can operate in parallel, working on different parts of 

different instructions. We will see that a fairly simple hardware design can sustain 
an execution rate close to 1 instruction per clock cycle. 

Processors that can sustain execution rates faster than 1 instruction per cycle 
are known as superscalar processors. Most modern processors support superscalar 
operation. In Chapter 5, we will describe a high-level model of such processors. 
i We will see that application programmers can use this model.to understand the 

performance of their programs. They can then write programs such that the gen- 
erated code achieves higher degrees of instruction-level parallelism and therefore 
, I runs faster. 1 


i Singlé-Instruction, Multiple-Data (SIMD) Parallelism 


At the lowest level, many modern processors have*special hardware that allows 
a single instruction to,cause multiple, operations to be performed in parallel, a 
mode known as single-instruction, multiple-data (SIMD) parallelism. For example, 
recent generations of Intel and AMD. processors have instructions that can add 8 
pairs of single-precision floating-point-numbers (C data type float) in parallel. 
| These SIMD instructions are provided mostly,to speed up applications that 
process image, sound, and video data. Although some, compi]ers attempt-to auto- 
matically extract SIMD parallelism,from C programs,a more reliable method is to 
write programs using special ve¢tor, data types supported in compilers such as Gcc. 
We describe this style of programming in Web Aside OPT:SIMD, as a supplement to 
the more general presentation,on program optimization found in Cpapter5. 4 


1.9.3 The Importance of Abstractions iri Computer Systems 


The use of abstractions is one of the most important concepts in computer sciencé. 
For example, one aspect of good programming practice is to formulate a simple 
application program interface (API) for a set of functions that allow programmers 
to usethe code without having to delve into its inner workings. Different program- 
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ming languages provide different forms and levels of support for abstraction, such 
as Java class declarations and € function prototypes. 

We have already been introduced to several of the abstractions seen’in com- 
puter systems, as indicated in Figure 1.18. On the processor side, the instruction set 
architecture provides an abstraction of the actual processor hardware. With this 
abstraction, a machine-code program.behaves as if it were executed on a proces- 
sor that performs just one instruction at a time. The underlying hardware is far 
more elaborate, executing multiple instructions in parallel, but always in a way 
that is consistent with the simple, sequential model. By keeping the same execu- 
tion model, different processor implementations can execute the same machine 
code while offering a range of cost and performance. 

On the operating system side, we have introduced thre& abstractions: files as 
an abstraction of I/O devices, virtual memory as an abstraction of program mem- 
ory, and processes as an abstraction of a running program. To these abstractions 
we add a new one: the virtual machine, providing an abstraction of the entire 
computer, including the operating system, the processor, and the programs. The 
idea of a virtual machine was introduced by IBM in the 1960s, but it has become 
more prominent recently as a way to manage computers that must be able to run 
programs designed for multiple operating systems (such as Microsoft Windows, 
Mac OS X, and Linux) or different versions of the same operating system. 

We will return to these abstractions in subsequent sections of the book. 


1.10 Summary 


A computer system consists of hardware and systems software that cooperate 
to run application programs, Information inside the computer is represented as 
groups of bits that are interpreted in different ways, depending on the context. 
Programs are translated by other programs into different forms, beginning as 
ASCII text and then translated by compilers and linkers into binary executable 
files. 

Processors read and interpret binary instructions that are stored in main mem- 
ory. Since computers spend most of their time copying data between memory, I/O 
devices, and the CPU registers, the storage devices in a system are arran ged ina hi- 
erarchy, with the CPU registers at the top, followed by multiple levels of hardware 
cache memories, DRAM main memory, and disk storage. Storage devices that are 
higher in the hierarchy are faster and more costly per bit than those lower in the 





28 Chapter 1 


A Tour of Computer Systems 


hierarchy. Storage devices that are higher in the hierarchy serve as caches for dé- 
vices that are lower in the hierarchy. Programmers can optimize the performance 
of their C programs by understanding and exploiting the memory hierarchy. « 

The operating system kernel serves as an intermediary between the applica- 
tion and the hardware. It provides three fundamental abstractions: (1) Files are 
abstractions for I/O devices. (2) Virtual memory is an abstraction for both main 
memory and disks. (3) Processes are abstractions for the processor, main memoiy, 
and I/O devices, ` m 

Finally, networks provide ways for computer systems to communicate with 
one another. From the viewpoint of a particular system, the nétwork is just another 
I/O device. 


E 
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Solutions to Practice Problems 


Solution to Problem 1.1 (page 22) 
This! problem illustrates that Amdahl’s law applies to more ‘than just computer 
at 


systems. ' 
? y f 


A. In terms of-Equation 1.1, we have œ = 0.6.and k = 1.5. Mare directly, travel- 
ing the 1,500 kilometers through Montana will require 10 hours, and the rest 
of the trip also requires 10 hours. This will give a speedup, of 25/(10 + 10) = 
1.25x. ^ 


B. In térms of Equation 1.1, we have a — 0.6, and we réquire S — 1.67, from 
which we can solve for k. More directly, to speed up the trip by 1.67x, we 
must decrease the overall time to 15 hours. The parts outside of Montana 
will still require 10 hours, so we must drive through Montana in 5 hours. 
This requires traveling at 300 km/hr, which is pretty fast for a truck! 


Solution to Problemi 1.2 (page 23)" 

Amdahl’s law is best understood By working through some examples. This ohe 

requires you to look at Equation. from an unusual perspective, I 
This problem'is a simple application of the'equation. You are given $ = 2 and 

o = 0.8, and you must then solve for k: 


Y 
iai 11 ` 


Jan 
1 (I 0.8) + 0.8/k 
Ö4+1.6/k =1.0 


k =2.67 
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Program Structure 


-ur exploration of computer systems starts by studying the com- 
puter itself, comprising a processor-and a memory subsystem. At 


This part of the book will give you a deep understanding of how 
i¢ation-programs are represented and executed. You will gain skills 


i. that help you | rite programs that are secure, reliable, and make the best 
;: üse'ofthe computing resources. 
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MES computers store and process information represented as two-valued 
signals. These lowly binary digits, or bits, form the basis of the digital revo- 
A lution. The familiar decimal, or base-10, representation has been in use for over 
1,000 years, having been developed in India, improved by Arab mathematicians in 
| the 12th century, and brought to the West in the 13th century by the Italian mathe- 
} matician Leonardo Pisano (ca. 1170 to ca. 1250), better known as Fibonacci. Using 
decimal notation is natural for 10-fingered humans, but binary values work better 
when building machines that store and process information. Two-valued signals 
can readily be represented, stored, and transmitted—for example, as the presence 
or absence of'a hole iti d punclied card, á$ à high òr low voltage on a wire, or as a 
magnetic domain oriented clockwise or counterclockwise. The electronic circuitry 
" for storirig and performing computations on two-valued signalsjs very simple and 
reliable, enabling manufacturers to integrate millions,-or even billions, of such 
circuits on a single silicon chip. E 1 

In isolation, a single bit is not very'uséful. When we group-bits together and 
apply some interpretation that gives meaning to the different possible bit patterns, 
however, we can represent the elements of any finite set. For example, using a 
binary number system, we can use groups of bits to encode nonnegative numbers. 
By using a standard character code, we can encode the letters and symbols in a 
document. We cover both of these encodings in this chapter, as well as encodings 
to represent negative numbers and to approximate, real numbers. 

We consider the three most important representations of numbers. Unsigned 
encodings are based on traditional binary notation, representing numbers greater 
than or equal to 0. Two’s-complemertt encodings.are the móst common way to 
represent signed integers, that is, numbers that may be either positive or negative. 
Floating-point encodings are a base-2 version of scientific notation for represent- 
ing real numbers. Computers implement arithmetic operations, such as addition 
and multiplication, with these different representations, similar to the correspond- 
ing operations on integers and real numbers. 

Computer repreSentations«use’a limited«number of bits to encode a number, 
and hence some operations can overflow when the results are too large to be rep- 
resented. This can lead to some surprising results. For example, on most of today's 
computers (those using a 32-bit representation for data type int), computing the 
expression 


"em 


200 * 300 * 400 * 500 





yields —884,901,888. This runs counter to the properties of integer arithmetic— 
computing the product of a set of positive numbers has yielded a negative result. 

On the other hand, integer computer arithmetic satisfies many of the familiar 
properties of true integer arithmetic. For example, multiplication is associative 
| and commutative, so that computing any of the following C expressions yields 
| —884,901,888: 


(500 * 400) * (300 * 200) 
| ((500 * 400) * 300) * 200 
((200 * 500) * 300) * 400 
400  * (200 * (300 * 500)) 
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The computer might not generate the expected result, but at least it is con- 
sistent! 

Floating-point arithmetic has altogether different mathematical properties. 

The product of a set of positive numbers will always be positive, although over- 
flow will yield the special value 4-co. Floating-point arithmetic is not associative 
due to the finite precision of the representation. For example, the C expression 
(3.14+1e20)-1e20 will evaluate to 0.0 on most machines, while 3.14+(1e20- 
1e20) will evaluate to 3.14. The different mathematical properties of integer 
versus floating-point arithmetic stem from the difference in how they handle the 
finiteness of their representations—integer representations can encode a compar- 
atively small range of values, but do so precisely, while floating-point representa- 
tions can encode a wide range of values, but only approximately. 
-  Bystudying the actual number representations, we can understand the ranges 
of values that can be represented and the properties of the different arithmetic 
operations. This understanding is critical to writing programs that work correctly 
over the full range of numeric values and that are portable across differerit combi- 
nations of machine, operating system, and compiler. As we will describe, a number 
of computer security vulnerabilities have arisen due to some of the subtleties of 
computer arithmetic. Whereas in an earlier era program bugs would only incon- 
venience people when they happened to be triggered, there are now legions of 
hackers who try to exploit any bug they can find to obtain unauthorized access 
to other people's systems. This puts a higher level of obligation on programmers 
to understand how their programs work and how they can be made to behave in 
undesirable ways. 

Computers use several different binary representations to encode numeric 
values. You will need to be familiar with-these representations as you progress 
into machine-level programming in Chapter 3. We describe these encodings in 
this chapter and show you how to reason about number representations. 

We derive several ways to perform arithmetic operations by directly ma- 
nipulating the bit-level representations of numbers. Understanding these tech- 
niques will be important for understanding the machine-level code-generated by 
compilers in their attempt to optimize the performance of arithmetic expression 
evaluation. 

Our treatment of this material is based on a core set of mathematical prin- 
ciples. We start with the basic definitions of the encodings and then derive such 
properties as the range of representable numbers, their bit-level representations, 
and the properties of the arithmetic operations. We believe it is important for you 
to examine the material from this abstract viewpoint, because programmers need 
to have a'clear understanding of how computer arithmetic relates to the more 
familiar integer and real arithmetic. 

The C++ programming language is built upon C, using the exact same numeric 
representations and operations. Everything said in this chapter about C also holds 
for C++. The Java.language definition, on the other hand, created a new set of 
standards for numeric representations and operations. Whereas the C standards 
are designed to allow a wide range of implementations, the Java standard is quite 
specific on the formats and encodings of data. We highlight the representations 
and operations supported by Java at several places in the chapter. 
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2.1 Information Storage 


Rather than accessing individual bits in memory, most computers use blocks of 
8 bits, or bytes, as the smallest addressable unit of memory. A: machine-level 
program views memory as a very large array of bytes, referred to as virtual 
memory. Every byte of memory is identified by a unique number, known as its 
address, and-the set of all possible'addresses is known as the virtual address space. 
As indicated by its name, this virtual address space is just a'conceptüal image 
presented to the machine-level program. The actual implementation (presented 
in Chapter 9) uses a combination of dynamic random access memory (DRAM), 
flash memory, disk storage, special hardware, and operating system software to 
provide the program with what appears to be a monolithic byte array. 

In subsequent chapters, we will cover how the compiler and run-time system 
partitions this memory space into more'manageable:units:to store the different 
program objects, that is, program data, instructions, and control information. 
Various mechanisms are used to allocate and manage the storage for different 
parts of the program. This management is all performed within the virtual address 
space. For example, the value of a pointer in C—whether it.points to an integer, 
a structure, or some other program object—is the virtual address of the first byté 
of some block of storage. The C compiler also associates'type information with 
each pointer, so that it can generate different machine-level code to access the 
value stored at the location designated by the pointer dependington the type of 
that value. Although the C:compiler maintains:this type information, the actual 
machine-level program it generates has no information about data types. It simply 
treats each program object as a-block of bytes and the program itself as a sequence 
of bytes. 
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i Aside The evolution of the C programming language rs 


Ritchie òf Bell Laboratories for,use with the Unix operating system (also developed at'Bel! Labs). At 
the time, most system programs, such as operating systems, had to be written largely in assembly code 
in order to:have access to the low-level representations. of different data'types: For example, it was 
not feasible to write a memory. allocator, such as is provided by,the malloc library function, in other 
high-lével languages of that'era. ^ 
The original Bell Labs version of C was documented i jn the first edition of the book by.Brian 
, Kernighan ànd Dennis Ritchie [60j. Over time, C has evolved through the efforts of several standard- 
i ization groups. The first major revision of the original Bell Labs, d led to the AN SI C standard in 19897 
by a group woiking under the auspices of; the American National Standards İnstitute. ANSI C was a 
fs major departure from Bell Labs C, éspecially in the way t functions are déclare. ANSI C is described. 
i 
[a 
i 


| As-was described in an aside on page 4, the-C programming language wa first developed by Dennis 


| in i the.second, edition of Kernighan and Ritchié* s, ‘book [61], which'is still considered one of the best 
' references on C. 
a The International ‘Siapdards, Organization t took oyer responsibility i for, standardizing the C lan- 
guage, adopting a version that was, Substantially t ihe same as “ANSI ei in 1990 and. hence i is referred to 
k as “ISO G90.” E 
This same organization sponsored an updating of thelanguage; in 1999 29, yielding "ISO C99.” Among 
other things, this ve version introduced some new data i types and; provided support for: text strings Tequiring 
! characters not found i in the English language. A morezecent s standard $ was approved i in 2011, and hence 
3 is named “ISO C11,” again adding: more data types and'features Most. of these recent additions have 
"been, backward compátible, meaning, that prográims written according | fo the earlier standard (at least 
as far backas ISO C90). will havé the same Déhávior when compiled according to the newer standards. 
TheGNU Co mpiler Collectio (ccc) can compile programs ACC tding to the conventions of seyeral 
, differentwesgiors of the C languagé; based on different’ gomniand- -line options, as shown in Figure 2.1. 
For example, “to ' compile. program prog ,¢ aeopditg Fois, Cli, we could give, the.command line 


x 


oe 


asmyorgce ~std=clivprog.c: cov sow ^ * 


Re “options siete and -std=c8d have’ identical ‘effect—the code ig compiled According to the ANSI 
“or ISO Co stahdatd. (C901; Is sOfhetirnés i iéférred fo as *C89, ” sirice' its standardization effort began iti 
1989 J "The. option. -afda causes the’ compiler tö 8 tolldithe ISO coy cofiventión. 

' JAsof the wiilihg of this book; Wheit no Option i is specified, thé program will Be compiled according 
tQ a version Sf C Based on ISO*C90; ;büt, including soie 4 features ot C99,*some of C11, some of 
Ctt, and, others specific to gcc..The GNU project, is developing a-version that combines ISO C11, 
plus.other features, that cari "be.specified witli the command-line option -std*gnu] 1. (Currently, this 
implementation i is incomplete.) This will become the-default version. 


fe i 
C version ccc command-line option 
GNU 89 none, -std-gnu89 

ANSI, ISO C90  -ansi,-std-c89 


ISO C99 -std=c99 
ISO C11 -std=ci1 


Figure 2.1 Specifying different versions of C to ccc. 
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| Pointérs ,ár a central feature’ of C, They. pProyide the mechanism for reférencirig elements: ‘of data 
structures, including arrays: Just like'a variable, a pointer “has two aspects: dts value and ‘its. type. The $ 
value indicates the location ofsome object; ^whilesits. type indicates what Kind-of object (e. g- integer or | 
{ floating “point nunifer) i is stored at that location. za% i 
Truly “understanding pointers requires, examining their représeritation and.implementation | at the ; 
machine level. This will be a major focus in Chapter 3, culminating in an in-depth presentation in Section 
33.1017 i 
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2.1.1 Hexadecimal Notation 


A single byte consists of 8 bits. In binary notation, its value rangés from 00000000, 
to 111111115. When viewed as a decimal integer, its value ranges from 049 to 25549. 
Neither notation is very convenient for describing bit patterns. Binary notation 
is too verbose, while with decimal notation it is tedious to convert to and from 
bit patterns. Instead, we write bit patterns as base-16, or hexadecimal numbers. 
Hexadeciinal (or simply *hex") uses digits '0' through '9' along with characters 
‘A’ through ‘F’ to represent 16 possible values. Figure 2.2 shows the decimal and 
biliary values associated with the 16 hexadecimal digits. Written in hexadecimal, 
the value of a single byte can rangé from 0046 to FF. 

In C, numeric constants starting with Ox or OX are interpreted as being in 
hexadecimal. The characters ‘A’ through ‘F’ may be written in either upper- or 
lowercase. For example, we could write the number FA1D37B;g as OxFA1D37B, as 
Oxfaid37b, or even fnixing upper- and lowercase (e.g., OxFa1D37b). We will use 
the C notatión for representing hexadecimal values in this book. 

A common task in working with machine-level programs is to manually con- 

P vert between decimal, binary, and hexadecimal representations of bit patterns. 
Converting between binary and hexadecimal is straightforward, since it can be 
performed one hexadecimal digit at a time. Digits can be converted by referring 
to a chart such as that shown in Fi ure 2.2. One simple trick for doing the conver- 
sion in your head is to memorize the decimal equivalents of hex digits A, C, and F. 


Hex digit 0 1 2 3 4 5 6 7 


Decimal value 0 1 2 3 4 5 6 7 
Binary value 0000 0001 0010 0011 0100 0101 0110 0111 
Hex digit 8 9 A B C D E F 
Decimal value 8 9 10 11 12 13 14 15 
Binary value 1000 1001 1010 1011 1100 1101 1110 1111 
a i ca i, 


Figure 2.2 Hexadecimal notation. Each hex digit encodes one of 16 values. 
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The hex values B, D, and E can be translated to decimal by computing their values 
relative to the first three. 

For example, suppose you are given the number 0x173A4C. You can convert 
this to binary format by expanding each hexadecimal digit, as follows: 


Hexadecimal 1 7 3 A 4 C 
Binary 0001 0111 0011 1010 0100 1100 


E 


This gives the binary representation 000101110011101001001100. 

Conversely, given a binary number 1111001010110110110011, you convert it 
to hexadecimal by first Splitting it into groups of 4 bits each. Note; hóWever, that if 
the total number of bits is not 4 multiple of 4, you'should make the leftmost group 
be the one with fewer than 4 bits, effectively padding the number with leading 
zeros. Then you translate each group of bits into tHe corresponding hexadecimal 
digit: 


Binary li 1100 1010 — 1101 1011 0011 
Hexadecimal 3 C A D B 3 





Bagni the following nimue conversions: 


A. Ox39ATF8 to binary 

‘binary 1100100101111011 to hexadecimal 
OxD5E4C to binary , 

binary 1001101110011110110101 to hexadecimal 


D ow 


When a value risa power of 2, that'is, x = 2” for some honnegative integer 
A, we can readily write x in hexadecimal form by remembering that the binary 
representation of x is simply 1 followed by n zeros. The hexadecimal digit 0 
represents 4 binary zeros. So, for n written in the form i 4- 4j j, where 0 <i <3, 
we can write x with a leading hex digit of 1 {i = 0), 2 (i = 1), 4 (i =2), or 8 
(i = 3), followed by j hexadecimal Os. As an example, for x = 2,048 = 214, we 
have n = 11 —3 + 4 - 2, giving hexadecimal representation 0x800. 





Filli in ihe blank entries in the following table, giving the decimal aid hexsdecimal 
representations of different powers of 2: 
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n  2"(decimal) 2” (hexadecimal) z 

9 512 0x200 
19 se 
16,384 

et 0x10000 
17 pis 
MUN 32 

0x80 





Y 
Converting between decimal and hexadecimal representations requires using 
multiplication or division to handle the general case. To convert a decimal num- 
ber x to hexadecimal, we can repeatedly divide x by 16, giving a quotient g and a 
remainder r, such that x =q 716 7. We then use the hexadecimal digit represent- 
ing r'as the least significant digit'and generate the remaining digits by repeating 
the process on q. As an example, consider the conversion of decimal 314,156: 


:314,156 = 19,634 - 16'+ 12 (@) 
19,634 =1,227 -16+2 (2) i 
1,227 276.164 11 (B) 
16=4-16+412 (C) 
420-1644 (4) 


From this we can read off the hexadecimal representation as Ox4CB2C. 

Conversely, to convert a hexadecimal number to decimal, we can multiply 
each of the hexadecimal digits by the appropriate power of 16. For example, given 
the number Ox7AF, we compute its decimal equivalent as-7 - 16? + 10 - 16+15= 
7-256 + 10-16 + 15 = 1,792% 160 + 15 = 1,967. g 





A. sage pes can, be represented a 2 hea an pen Fill in the missing 
entries in the following table, giving the decimal, binary, and hexadecimal values 











of different byte patterns: . & , 
Decimal Binary — Hexadecimal 
—————————————————————— 
‘O 00000000 0x00 
167 . ic 
62 ao 
188 Jee 
0011 0111 


10001000 Eu 
x. moon RS , 
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| Fot converting larger valiesbétween decimal and hex: adéciinál, itis best: to fiet a’computer or-calculator* : 
do the work: There'are nymeroys tools, that can do this. Ong simple, way is to, use, d ot the standard ' 
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PEP pii OxAC 
rl See a re OxE7 





Withogt converting the namber, tó Hecimalic or hinar, try t to solve the Dllowie 
arithmetic, roblems, giving, the answers in hexadecimal. Hint: Just modify the 
methods you use for performing. decimal addition and subtraction to use base 16. 


A. 0x503c + 0x8 = 

B. 0x503c — 0x40 = 

C. Ox503c + 64>, 42 
D. 0x50ea — 0x503c = 


2.1.2 Data Sizes 


Every computer has a word size, indicating the nominal Size of pointer data. Since 
a virtual address is encoded by such a word, the most important system parameter 
determined by the word size is the maximum size of the virtual address space. That 
is, for a machine with a w-bit word size, the virtual addresses can range from 0 to 
2" — 1, giving the program access to at most 2” bytes. 

In recent years, there has been a widespread shift from machines with 32- 
bit word sizes to those with word sizes of 64 bits. This occurred first for high-end 
machines designed for large-scale scientific and database applications, followed 
by desktop and laptop machines, and most recently for the processors found in 
smartphones. A 32-bit word size limits the virtual address space to 4 gigabytes 
(written 4 GB), that is, just over 4 x 10? bytes. Scaling up to a 64¢bit word size 
leads to a'virtual address space of 16 exabytes, ór'arouríd 1.84 x 10! bytes. 
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| 

Most 64-bit machines can also run programs compiled for use on 32-bit ma- 
| chines, a form of backward compatibility. So, for example; whena program prog.c 
| 


u OBRA ja ax x am ok ia 


is compiled with the directive 


E 


linux?» gcc -m32 prog.c 


then this program will run correctly on either a 32-bit or a 64-bit machine. On the 
other hand, a program compiled with the directive 


linux? gcc -m64 prog.c 


will only run on a 64-bit machine. We will therefore refer to programs as being 
either “32-bit programs" or “64-bit programs,” since the distinction lies in how a 

| program is compiled, rather than the type of machine on which it runs. 
| Computers and compilers support multiple data formats using different ways 
i to encode data, such as integers and floating point, as well as different lengths. 
For example, many machines have instructions for manipulating single bytes, as 
à well as integers represented as 2-, 4-, and 8-byte quantities. They also support 

: floating-point numbers represented as 4 and 8-byte quantities. 

The C language supports multiple data formats for both integer and floating- 
point data. Figure 2.3 shows the number of bytes typically allocated for different C 
i data types. (We discuss the relation between what is guaranteed by the C standard 
! versus what is typical in Section 22. ) The exact numbers of bytes for some data 
types dépends on how the program is compiled. We show sizes for typical 32-bit 
and 64-bit programs. Intéger data can be either signed, able to represent negative; 
| zero, and positive values, or unsigned, only allowing nonnegative values. Data 
type char represents a single byte. Although the name char derives from the fact 
that it is used to store a single character in a text string, it can also be used to store 
integer values. Data types short, int, and long are intended ‘to provide a range of 


C declaration Bytes 

! Signed Unsigned 32-bit 64-bit 

' [signed] char ^ unsigned char A 1 

| short unsigned short 2 2 
int unsigned 4 4 
long unsigned long 4 8 

| int32_t uint32_t 4 4 

i int64_t uint64 t 8 8 
char'* 4 8. 
float 4 4 n 

à double 8 8 


à UU 
2 


| Figure 2.3 Typical sizes (in bytes) of basic C. data types. The number of bytes allocated 
varies with-how the program is compiled. This chart shows the values typical of 32-bit, 
and 64-bit programs. 
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sizes. Even when compiled for 64-bit systems, data type int is usually just 4 bytes. 
Data type long commonly has 4 bytes in 32-bit programs and 8 bytes in 64-bit 
programs. j 

To avoid the vagaries of relying on “typical” sizes and different compiler set- 
tings, ISO C99 introduced a class of data types where the data sizes are fixed 
regardless of compiler and machine settings. Among these are data types int32_t 
and int64_t, having exactly 4 and 8 bytes, respectively. Using fixed-size integer 
types is the best way for programmers to have close control over data represen- 
tations. 

Most of the data types encode signed values, unless prefixed by the keyword 
unsigned or using the specific unsigned declaration for fixed-size data types. The 
exception to this is data type char. Although most compilers and machines treat 
these as signed data, the Cstandard does not guarantee this. Instead, as indicated 
by the square brackets, the programmer should use the declaration signed char 
to guarantee a 1:byte signed value. In many contexts, however, the program's 
behavior is insensitive to whether data type char is signed or unsigned. 

The C language allows a variety of ways, to order the keywords and to include 
or omit optional keywords. As-examples, all of the following declarations have 
identical meaning: 


unsigned long 
unsigned long int 
long unsigned 


long unsigned int 


We will consistently use the forms found in Figure 2.3. 

Figure 2.3 also shows that a pointer (e.g., a variable declared as being of 
type char *) uses the full word size of the program. Most machines also support 
two different floating-point formats: single precision, declared in C as float, 
and double precision, declared in C as double. These formats use 4 and 8 bytes, 
respectively. 

Programmers should strive to make their programs portable actoss different 
mdchines and compilers. One aspect of portability is to make the-program insensi- 
tive to the exact sizes of the different data types. The C standards set lower bounds 
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on the numeric ranges of the different data types, as will be covered later, but there 


are no upper bounds (except with the fixed-size types). With 32-bit machines and i 
32-bit programs being the dominant combination from around 1980 until around , 
2010, many programs have been written assuming the allocations listed for 32- i 


bit programs in Figure 2.3. With the transition to 64-bit machines, many hidden 
word size dependencies have arisen as bugs in migrating these programs to new , 
machines. For example, many programmers historically assumed that an object : 
declared as type int could be used to store a pointer. This works fine for most l 
32-bit programs, but it leads to problems for 64-bit programs. | 


2.1.3 Addressing and Byte Ordering 


For program objects that span multiple bytes, we must establish two conventions: 
what the address of the object will be, and how we will order the bytes in memory. al 
In virtually all machines, a multi-byte object is stored as a contiguous sequence 
of bytes, with the address of the object given by the smallest address of the bytes 
used. For example;suppose a variable x of type int has address 0x100; that is, the 
value of the addréss expression &x is 0x100. Then (assuming data type int has a 
32-bit representation) the4 bytes of x would be stored in memory locations 0x100, 
0x101, 0x102, and 0x103. : d 

For ordering the bytes representing an object, there are two common conven- 
tions. Consider a w-bit integer having a bit representation [x,,_1, Xy 5. - - ++ X1xX0h , 
where x,,. is the most significant bit and xp is the least. Assuming w is a multiple 
of 8, these bits can be grouped as bytes, with the most significant byte having bits 
[Xu Xw—2» +++» p—g}, the least sighificant byte having bits [x7, x6, . . . , xo], and 
the other bytes having bits from the middle. Some machines choose to store the ob- 
ject in memory ordered from least significant byte to most, while other machines 
store them from most tó least. The former convention—where the least significant 
byte comes first—is referred to as little‘endian. The latter convention--where the 
most significant byte comes first—is referred'to as big endian. 

Suppose the variable x of type int and at address 0x100 has a hexadecimal 
value of 0x01234567. The ordering of the bytes within the address range 0x100 
through 0x103 depends on the type of machine: 


Big endian 
0x100 0x101 0x102 0x103 






Little endian 





Note that in the word 0x01234567 the high-order byte has hexadecimal value 
0x01„while the-low-order byte has value 0x67. 

Most Intel-compatible machines operate exclusively in little-endian mode. On 
the other hand, most machines from IBM and Oracle (arising from their acquisi- 
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i Aside Origin of “endian” 


* 


"Here i$ fiow Jonathan Swift, writing in 1726, described thé history of the controversy between big and | 


| little endians: i in 
Hd Ga & 


... Lilliput and Blefuscu . . . have, as I was going to tell you, been engaged in a most obstinate war 
** for sixzand-thirty inoons past. It began upon the following occasion. It is allowed on all hands, that 
the primitive way of breaking eggs, before we eat them, was upon the larger end; but his present 
majesty's grandfather, while he was a boy, going to eat an egg, and breaking.it according to the 
"ancient practice, happened to*cut one of his finger’. Whereupon the empéror his father published 
an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs. 
The people so highly resented this law, that our histories tell us, there Have been six rebellions raised 
«on that account; wherein one emperor lost his life, and another his crown. These civil commotions 
were constantly fomented by the monarchs of Blefuscu; and when they were,quelled, the exiles 
always fled for refuge to that empire. It is computed that eleven thousand persons have at several 
times suffered death, rather than submit to break their eggs at the smaller end. Many hundred 
large volumes have been.püblished upon this controversy: but the books of thé Big-endians have 
been long forbidden, and the whole party rendered incapable by law of holding employments. 
(Jonathan Swift. Gulliver's Travels, Benjamin Motte;1726:) * 
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» 


In his day, Swift was satirizing the continued conflicts between England (Lilliput) and France (Blefuscu). 
Danny Cohen, an early pioneer in networking protocols, first applied,these terms to refer to byte 
ordering [24], and the terminology has been widely adopted. 
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tion of Sun Microsystems in 2010) operate in big-endian mode. Note that we said 
“most.” The conventions do not split precisely along corporate boundaries. For 
example, both IBM and Oracle manufacture machines that use Intel-compatible 
processors and hence are little endian. Many recent microprocessor chips are 
bi-endian, meaning that they can be configured to operate as either little- or 
big-endian machines. In practice, however, byte ordering becomes fixed once a 
particular operating system is chosen. For example, ARM microprocessors, used 
in many cell phones, have hardware that can operate in either little- or big-endian 
mode, but the two most common operating systems for these chips—Android 
(from Google) and IOS (from Apple)—operate only in little-endian mode. 

People get surprisingly emotional about which byte ordering is the proper one. 
In fact, the terms “little endian” and “big endian” come from the book Gulliver's 
Travels by Jonathan Swift, where two warring factions could not agree as to how a 
soft-boiled egg should be opened—by the little end or by the big. Just like the egg 
issue, there is no technological reason to choose one byte ordering convention over 
the other, and hence the arguments degenerate into bickering about sociopolitical 
issues. As long as one of the conventions is selected and adhered to consistently, 
the choice is arbitrary. 

For most application programmers, the byte orderings used by their machines 
are totally invisible; programs compiled for either class of machine give identi- 
cal results. At times, however, byte ordering becomes an issue. The first is when 
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common problem is for. data produced by a little-endian machine to be sent to 
a big-endian machine, or vice versa, leading to the bytes within the words being 
in reverse order for the receiving program. To avoid such problems, code written 
for networking applications must follow establishéd conventions for byte order- 
ing to make sure the sending machine converts its internal fepresentation to the : 
network standard, while the receiving machine converts the network standard'to 
its internal representation. We will see examples of these conversions in Chap- 
ter 11. u 
| A second case where byte ordering becomes important'is when looking at 
the byte sequences representing integer data. This.occurs often when inspecting 
machine-level programs. As an example, the following line occurs in a file that 
gives a text representation of the machine-level code for an Intel x86-64 processor: 


| 
t 
| binary data are communicated over a network between different machines, A 
| 


i 
400443: 01 05 43 Ob 20 00 add Jeax, Ox200b43 (Arip) 


This line was generated by a disassembler, a tool that.determines the instruction 
sequence represented by an executable program file. We will learn more about 
disassemblers and how to interpret lines such as this in Chapter 3. For now, we 
simply note that this line states that the hexadecimal byte sequence 01 05 43 Ob 
20 00 is the byte-level representation of an instruction that adds a word of data 
| to the value stored at an address computed by adding 0x200b43 to the Current 
value of the program counter, the address of the next instruction to be executed. 
If we take the final 4 bytes of the sequence 43 0b 20 00 and write them in reverse 
| order, we have 00 20 0b 43. Dropping the leading 0, we have the value 0x200b43, | 
the numeric value written. on the right. Having. bytes appear im reverse order | 
i is a common occurrence when reading machine-level program representations 
1 generated for little-endian machines such as this one. The natural way to.write a 
: byte sequence is to have the lowest-numbered byte on the left and the highest on 
| the right, but this is contrary to the normal way of writing numbers with the most 
M significant digit on the left and the least on the right. is 

A third case where byte ordering becomes visible is when: programs are 
l written that circumvent the normal type system. In the C language, this:can.be 
: done using a cast or a union to allow an object to be referenced according to 
| a different data type frdém which it was created. Such'toding:tricks are strongly 
i 
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discouraged for most application programming, but they can be quite úseful and 
everi necessary for system-level programming. 

Figure 2.4 shows C code-that uses casting to access and print the byte rep- 
resentations of different program objects. We use typedef.to define data type 
byte. pointer asa pointer to an object of type unsigned char. Such a byte pointer 
references a sequence of bytes where each byte is considered to be a nonnega- 
tive integer. The first routine show. bytes is given the address of a sequence of 
i bytes, indicated by a byte pointer, and a byte count. The byte count is specified as 
' having data type size_t, the preferred data type for expressing the sizes of data 
structures. It prints the individual.bytes in hexadecimal. The C formatting direc- 
tive %. 2x indicates that an integer should be printed in hexadecimal with at least 
2 digits. 
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1 #include <stdio.h> 

2 

3 typedef unsigned char *byte_pointer; 

4 

5 void show bytes(byte pointer start, size t len) { 
6 int i; 

7 for (i = 0; i < len; i++) 

8 printf(" %.2x", start[i]); 

9 printf("\n"); 

10 3 

11 

12 void show_int(int .x), { 

13 show_bytes((byte_pointer) &x, sizeof(int)); 
14 3 

15 

16 void show float(float x) ( ; 

17 Show bytes((byte pointer) £x, sizeof (float)); 
18. } 

19 z 

20 void show_pointer(void *x) { 

21 show_bytes((byte_pointer) &x, sizeof(void *)); 
22 } 


Figure 2.4 Code to print the byte representation of program objects. This code 
uses casting to circumvent the type system. Similar functions are easily defined for other 
data types. 


Procedures show_int, show_float, and show_pointer denionstrate how to 
use procedure show. bytes to print the byte representations of C'program objects 
of type int, float, and void *} respectively. Observe:that they simply pass show_ 
bytes a pointer &x to their argument x, casting the ointer to be of type unsigned 
Char *. This cast indicates to the Compiler that th Bropram shoüld consider the 
pointer to be to a’ sequence of bytes rather than to an object of'the original data 
type. This pointer will then be to the lowest byte address occupied by the object. 

These procedures use the C sizeof operator to determine the number of bytes 
used by the object. In general, the expression sizeof(T) returns the number of 
bytes required to store an object of type T. Using sizeof rather than a fixed value 
is one step toward writing code that is portable across different machine,types. 

We ran the code shown in Figure2.5 or,seyeral different machines, giving the 
results shown in.Figure 2.6. The following maghines were used: 


Linux 32 , Intel 1432 processor runniiig Linux. 

Windows ^ Intel IA32 processor running Windows. at 

Sun Sun Microsystems SPARC processor running Solaris. (These machines 
are now produced by Oracle.) 

Linux 64 Intel x86-64 processor running Linux. 
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code/data/show-bytes.c 


1 void test, show bytes(int val) ( 
2 int ival = val; 

3 float fval = (float) ival; 
4 int *pval = &ival} 

5 show int(ival); 

6 show, float(fval); 

7 show, pointer(pval); 

8 


" 





code/data/show-bytes.c 


Figure 2.5 Byte representation examples. This code prints the’ byte representations 
of sample data objects. 
X 





Machine Value Type Bytes (hex) 





Linux 32 12,345 int 39 30 00 00 
Windows 12,345 int 39 30 00 00 
Sun 12,345 int 00 00 30 39 
Linux 64 12,345 int 39 30 00 00 
Linux 32 12,345.0 float 00 e4 40 46 
Windows 42,345.0, float 100 e4-40 46 n 
Sun 12,345.0 float 46 40 e4 00 


Linux 64 12,345.0 float 00 e4 40 46 


Linux 32 &ival int*  e4f9ff bf 
Windows &ival int* b4 cc 2200 
sun &ival int.* ef ff fa Oc 
Ligux 64 &ival int. °b811e5 ff ff 7f 0000 


r 





Figure 2.6 Byte representations of different data values. Results for int and float 
are identical, except for byte ordering. Pointer values are machine dependent. 


Y 


Our argument 12,345 has hexadecimal representation 0x00003039. For the int 
data, we get identical results for all machines, except for the byte ordering. In 
particular, wé'can see that the least significant byte value of 0x39 is printed first 
for Linux 32, Windows, and Linux 64, indicating little-endian machines, and last 
for Sun, indicating a big-endian machine. Similarly, the bytes of the float data 
are identical, except for the byte órtlering. On the other hand, thé pointer values 
are completely different. The different machine/operating system configurations 
use different conventions for storage allocation. One feature to note is that the 
Linux 32, Windows, and Sun máchines use 4-byte addresses, while the Linux 64 
machine uses 8-byte addresses. i 
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i New to C? Naming data types with typedef’ 


į in improving code readability, since deeply nested type declarations can be difficult to decipher., 
The syntax for typedef is exactly like that of declaring a variable, except that it uses a type name 
rather than a variable name. Thus, the declaration of byte. pointer in Figure 2.4 has the same. form as 
j the declaration of a variable of type unsigned char, *. 
| For example, the declaration 


: 
| The typedef declaration in C provides a way of giving a name to a data type. This can be a great help 
i 
è 
i 


typedef int *int pointer; 
t int pointer ip; 


{ defines type int. pointerto be a pointer to an int, and declares a variable ip of this type. Alternatively, 
i we could declare this variable directly as 


i int *ip; » 2 4 
Sb 
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New to C? Formatted printing with printf ; 
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The printf 'function (along with its cousins fprintf and sprintf) provides a way to print information 
with considerable cdntrol-over'the Tormatting details. The first argument is a format string, while any 
remaining arguments are values to bé printed. Within the format string, each character sequence 
starting with ‘%’ indicates how to format the next argument. Typical examples include %d to print a 
decimal integer, %£ to print a floating-point number, and %c to print a character having the character 
i code given by the argument. 
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1 Specifying the formatting of fixed-size data types, such as int, 32t, is a'bit more involved, a$ i5 
Í described in the aside on page 67. d 
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Observe that although the floating-point and the integer data both encode 
the numeric value 12,345, they have very different byte patterns: 0x00003039 
for the integer and 0x4640E400 for floating point. In general, these two formats 
use different encoding schemes. If we expand these hexadecimal patterns into 
binary form and shift them appropriately, we find a sequence of 13 matching bits, 
indicated by a sequence of asterisks, as follows: 


0. 000 3 0 3 9 
00000000000000000011000000111001 
EEEE EEEE EEE EE 
4 6 4 0 E 4 0 90 
01000110010000001110010000000000 


This is not coincidental. We will return to this example when we study floating- 
point formats. 
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New to C? Pointers.and arrays uw ortu & ow Ro M P 
in function’shot_bytes (Figure.2:4), we see'thé;close'corinection between’ pointers ahd-arrays; as^will 
be discussed in detail in Section 3.8. Wesée ‘that:this functionchas'an argument ‘start of«type' byte_ 
, pointer (which! ha$ been‘defined ‘to b&.à pointer.to.inadindd chaz3,'but we'sée'thé array reference 71 
start [iJ “on line 8. Iti C,we'can dereference a pointer-with arraynotation, and we tari reference’ array 1 
í 
i 


a 


elẹments with pointer notation. In this examples, the réference start [i] indicates thatwe wani to read 
the:byte that is-i positions beyond t the focation “pointéd. to by Start, wee P at zà 


" 
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, distinctive character. The.C "address of? operator '£' creates a pointer. "Ort all three fines, the expression 
E ‘creates a pointer £o.the location holding thé object indicated by Variable 3 x. The type “Ot this. pointer j 
* depends on the type. of: x,'and hence these three pointers. ‘are“of type int *, float x, and void, 
respectively; (Data type Void *isa speciál kind of pointer with no associated type, information. ) 

The cast operator converts from, one data, typé to. "atiothél. Thas,- thé ‘cast "(Byte pointeb)'£x 
indicates that whatever r type the, pointér &x had before, the program w, il Dow, reference a pointer tọ, 
data of type unsigned Chak.' The casts shown here dà, nof chan get the actual. pointer; they: simply, direct 

| thẹ compiler, to refer to the data being pointed to according'to the new, | data type. 
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IA lines 13, 17, and 21 of Figure 2.4 we see«uses’of two operations. that give C (and therefore Ct) 4 
j 
1 


* 
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+ Aside — Generating an, ASCII, table Wo. fe "ub o. ZI 


You can display a table showing the ASCII character code by executing the command | dan asta. 
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Consider the following area! calls to civ pepe 


int val © 0x87654321; 

byte pointer valp = (byte pointef) &val; 
show bytes(valp, 1); /* A. */ 

show bytes(valp, 2); /* B. */ 

Show bytes(valp, 3); /* C. */ 


al 


Indicate the values that will be printed by each call on a little-endian machine 
and on a big-endian machine: 


A. Little endian: „...—— Big endian: no 





B. Little endian: Big endian: 


C. Little endian: ——————— Big endian: — 
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Using ei lior. int and! show_ f gue we determine that the integer 3510593 has hexa- 
decimal representation 0x00359141, while the floating-point number 3510593.0 
has hexadecimal representation 0x4A564504. 


A. Write the binary representations of these two hexadecimal values. 


B. Shift these two strings relative to one another. to maximize the number of 
matching bits. How many bits match? 
C. What parts of the strings do not match? 


2 í ‘ ke 


2.1.4 Representing Strings 


A string in € is encoded by an array of characters terminated by the null (having 
value 0) character. Each character is represented by some standard encoding, with 
the most common being the ASCII character code. Thus, if we run our routine 
show_bytes with arguments "12345" and 6 (to include the terminating character), 
we get the result 31 32 33 34 35 00. Observe that the ASCII code for decimal digit 
x happens to be 0x3x, and that the terminating byte has the hex representation 
0x00. This same result would be obtained on any system using ASCII as its 
character code, independent of the byte ordering and word size conventions. As 
a consequence, text data are more platform independent than binary data. 





What Would be printed asa S esult of the following call to diog bytes? 


const char *s = "abcdef"; 
Show, bytes( dc d s, strlen(s)); 


"a 


Note that letters ‘ a"through ‘z’ have ASCII codes 0x61 through 0x74. 


3 





2.1.5 Representing Code 
Consider the following C function: 


1 int sum(int x, int y) ( 
2 returti x + y; 
$6 
€ 
When compiled on our sample machines, we generate machine code having 
the following byte representations: 


Linux 32 55 89 e5 8b 45 Oc 03 45 08 c9 c3 
Windows ^ 5589 e5 8b 45 0c 03 45 08 5d c3 
Sun 81 c3 e0 08 90 02 90 09 

Linux64 ^ 554889 e5 89 7d fc 89 75 £8 03 45 fc c9 c3 
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Asidé The Unicode Standard, for text encoding eat " 54 
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- "The ASCII character.set is; suitablg-for encoding” English; danguage Moguments, ‘but it does not have. 
, much'in thé .way’ ôf spécjal characters, such “asthe French fe tis, wholly dmsuited: for encoding 1 
, documents in languages such as Greek, Russian and Chinese Ove? ihe years; | variety of methods *, 
' "have baen developed to encodé text fór,  difféteny languages: ‘The | Uniébaé;Consortifin t has devised thé | 
most comprehensite'and wiülely'acce xed'stándard* for.encoding text. The currént-Unicóde standard ] 
* «(version 7.0) has a repertoire of over 100,000 characters supporting’) wide raj geof languages, including, i 
, the ancient languages of Egypt and Babylon, To théir credit, the Unicode Teghhical, Copymittes rejected * 
» a proposal to include 'a standard Writing for, Klingon, a'fictional civilization. from the:television seriés 
; Star Trek: a 
The base encoding, known as-the "Uifiversal*Cliáraétér Set"of Unignde, uses d d 32-bit it represenfa- ? 
tion of characters: ‘This would seem to require every ‘stiing‘of text to consist of T "bytes per character: ' 
However, alternatiye codings are possible where common charactérs.require.just ‘1 ot 2.bytes, while ~ 
less common ones require. more. In particulary the; UTE-STeprésentation 'entodes each:character ag à 
sequence of bytes, such that; the.standard. ASGII characters use the sámé single-byté encodings as they 
have i in ASCII, implying that all ASCII byte sequences, have thé sanié méaning in'UTF'8 as;they'do iri-4 


í 
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. Thé Java programming language’ yises Unicdde arf its representations of.strings. Program libraries | 
are alsó available for C to support Unicóde. . gu uL spt ra ; $ 
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Here we find that the instruction codings are different. Different machine types 
use different and incompatible instructions and encodings. Even identical proces- 
sors running different operating systems have differences in their coding conven- 
tions and hence are not binary compatible. Binary code is seldom portable across 
different combinations of machine and operating system. 

A fundamental concept of computer systems is that a program, from the 
perspective of the machine, is simply"a sequence of bytes. The machine has no 
information about the original source program, except perhaps some auxiliary 
tables maintained to aid in debugging. We will see this more clearly when we study 
machine-level programming in Chapter 3. 


2.1.6 Introduction to Boolean Algebra 


Since binary values are at the core of how computers encode, store, and manipu- 
late information, a rich body of mathematical knowledge has evolved around the 
study of the values 0 and 1. This started with the work of George Boole (1815- 
1864) around 1850 and thus is known as Boolean algebra. Boole observed that by 
encoding logic values TRUE and FALSE as binary values 1 and 0, he could formulate 
an algebra that captures the basic principles of logical reasoning. 

The simplest Boolean algebra is defined over the two-element set (0, 1). 
Figure 2.7 defines several operations in this algebra. Our symbols for representirig 
these operations are chosen to match those used by the C bit-level operations, 
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~ & 01 | 0 1 i 0 1 
0 1 0 00 0 01 0 0 1 
1 0 1 0 1 1 11 1 1 0 


Figure 2.7 Operations of Boolean algebra. Binary values 1 and 0 encode logic values 
TRUE and FALSE, while operations ~, & |, and ^ encode logical operations NOT, AND, OR, 
and EXCLUSIVE-OR, respectively. 


as will be discussed later. The Boolean operation - corresponds to the logical 
operation NOT, denoted by the symbol —. That is, we say that —P is true when 
P is not true, and vice versa. Correspondingly, ~p equals 1 when p equals 0, and 
vice versa. Boolean operation & corresponds to the logical operation AND, denoted 
by the symbol ^. We say that P ^ Q holds when both P is true and Q is true. 
Correspondingly, p & q equals 1 only when p = 1 and q = 1. Boolean operation 
| corresponds to the logical operation or, denoted by the symbol v. We say that 
P v Q holds when either P is true or Q is true. Correspondingly, p | q equals 
1 when either p = 1 or q — 1. Boolean operation ^ corresponds to the logical 
operation EXCLUSIVE-OR, denoted by the symbol ®. We say that P & Q holds when 
either P is true or Q is true, but not both. Correspondingly, p ^q equals 1 when 
either p = 1 and q —0,or p = 0andq — 1. 

Claude Shannon (1916-2001), who later founded the field of information 
theory, first made the connection between Boolean algebra and digital logic. In 
his 1937 master's thesis, he showed that Boolean algebra could be applied to the 
design and analysis of networks of electromechanical relays. Although computer 
technology has advanced considerably since, Boolean algebra still plays a central 
role in the design and analysis of digital systems. 

We can extend the four Boolean operations to also operate on bit vectors, 
strings of zeros and ones of some fixed length w. We define the operations over bit 
vectors according to their applications to the matching elements of the arguments. 
Let a and b denote the bit vectors [a, 1, à, 2, . .. , ag] and [by_1, b, 2, ... Bo], 
respectively. We define a & b to also be a bit vector of length w, where the ith 
element equals a; & b;, for 0 <i < w. The operations |, ^, and ~ are extended to 
bit vectors in a similar fashion. 

As examples, consider the case where w — 4, and with arguments a — [0110] 
and b = [1100]. Then the four operations a & b, a | b, a ^ b, and ~b yield 


0110 0110 0110 
& 1100 | 1100 ^ 1100 - 1100 
:0100 1110 1010 0011 





Fill in the following table showing the results of evaaa Boolean UDeraions 6 on 
bit vectors. 





n 


i 





K 
A * 





| inverse —x, such that x + =x =Q. A similar property holds for Boolean rings, where ~ is the "addition" 


, and combine them in a different order, andso (a ^ b) ^ a =b. This property leads to someinterestirig 
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Web Aside DATA:BOOL. More:ón Boolean algebra and Boolean rings 


The Boolean operations |, & and - operating on, bit véctofs of length w form a:Booleün algebra, 
for any integer w > 0. The simplest isthe case where w = 1 and there are'just two elements, byt for 
the niore general case there are 2" bit vectors of length w. Boóléan algebra has niany of jhe same 
propérties as-arithmetic over integers. For, example, just as multiplication distributes over, addition, 
written a - (b +c) = (a *b) + (a *c), Booleánbperation & distributes over |, written a & (b | cy — (a & b) 4 
(a & c). In addition, however. Boolean operation | distributes over &, and so we can write a | (b&c) = 
(a | b) & (a | c), whereas we cannot say that a+ (b : c) — (a +b): (a + c) holds for all integers. 

When we consider óperationg ^, &, and “operating on bit vectors of lengthw, we get a different 
mathematical form, known as a Boolean ring. Boolean rings haye many profierties in &mmon with, 
integer arithmetic. For example, one. property of integer arithmetic is that every value £^hás an additive 


operation, but in this case each elementis its Qwn ‘additive, inverse. That is,a7a=0 for any valie a, 

where we use O.here tó represent’ a bit Vector of all.zerds, We cah'seé this holds fobsifgle bits, since 
aa pe P rd a im 

0*0=171=0, and it exténds to bit Véctors as well. This property holds even when we rearrange terms 
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results and clever tricks, as We will éxplorg in Problém 2.10. 


Ea 


Operation Result 

a [01101001] 
b [01010101] 
~a 
~b E 

a&b 

alb 

a^b 


One useful application of bit vectors is to represent finite sets. We can encode 
any subset A C (0, 1,..., w — 1) witha bit vector [ay 4, ... , a1, ag], where a; = 1if 
and only ifí é A. For example, recalling that we write a,,..; on the left and ag on'the 
right, bit vector a = [01101001] encodes the set A = (0, 3, 5, 6}, while bit vector b = 
[01010101] encodes the set B = (0, 2, 4, 6). With this way of encoding sets, Boolean 
operations | and & correspond to set union afd intersection, respectively, and ~ 
corresponds to set complement. Continuing our earlier example, the operation 
a & b yields bit vector [01000001], while A N B = (0, 6}. 

We will see the encoding of sets by bit vectors in a number of practical 
applications. For example, in Chapter 8, we will see that there are a number of 
different signals that can interrupt the execution of a program.:We can selectively 
enable or disable different signals by specifying a bit-vector mask, where a 1 iri 
bit position i indicates that signal i is enabled and a 0 indicates that it is disabled. 
Thus, the mask represents the set of enabled signals. 
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by mixing three different colors of light: red, green, and blue. Imagine a simple 
scheme, with three different lights, each of which can be tyrned on or off, project- 
ing onto a glass screen: 


Light sources Glass screen 





Red 


Observer 


Green 
n^ 


Blue 


We can.then create eight different colors based on the absence (0) onpresence 
(1) of light sources R, G, and B: 


R G B Color 
0 0 0 Black 
0 0 1 Blue 

0 1 0 Green 
0 1 1 Cyan 

1 0 0 Red 

1 0 1 Magénta 
1 1 0 Yellow 
1 1 1 White 


t 
Each of these colors can be represented a5 a bit vector of length 3, and we can 
apply Boolean.operations:to.them. 


A. The complement of a color is formed by E off the lights that dre on and 
turning on the lights that are off. What would be the complement of each of 
the eight colors listed above? 


B. Describe the effect of applying Boolean operations on the following colors: 


Blue | Green = __ 
Yellow & Cyan Mc 
Red^ Magenta 2... 
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One useful feature of C is that it supports bitwise Boolean operations. In fact, the 
' symbols we have used for the Boolean operations are exactly those used by C: 
1 | for ok, & for AND, ~ for NOT, and ^ for EXCLUSIVE-OR. These can be applied to 
any "integral" data type, including all of those listed in Figure 2.3. Here are some 


| 

E 

b 4 
2.1.7 Bit-Level Operations in C ] 
| examples of expression evaluation for data type char: i 


C expression Binary expression Binary result Hexadecimal result ; 
x ~0x41 ~[0100 0001] [1011 1110] OxBE : 
| ~0x00 ~[0000 0000] [1111 1111] OxFF i 
| 0x69 & Ox55 [0110 1001] & [0101 0101] [0100 0001] Ox41 : 
0x69 | 0x55 [0110 1001] 1 (0101 0101] [0111 1101] OxT7D 
i 


As our examples show, the best way to determine the effect of a bit-level ex- 
pression is to expand the hexadecimal arguments to their binary representations, 
perform the operations in binary, and then convert back to hexadecimal. 











As an application of the property that a^ a = 0 for any bit vector a, consider the 
following program: j 
1 void inplace swap(int *x, int *y) { 

2 *y = *x ^ ay; /* Step 1 */ a 
3 ax = *x ^ ty; /* Step 2 */ 
4 

5 


ky = *x ^ *y; /* Step 3 */ 
} 


As the name implies, we claim that the effect of this procedure is to swap 
a the values stored at the locations denoted by pointer variables x and y. Note 
that unlike the usual technique for swapping two values, we do not need a third 
location to temporarily store one value while we are moving the other. There is 
: no performance advantage to this way of swapping; it is merely an intellectual 
! amusement. 
^ Starting with values a and b in the locations pointed to by x and y, respectively, 
| fill in the table that follows, giving the values stored at the two locations after each 
' step of the procedure. Use the properties of ^ to show that the desired effect is 
athieved. Recall that every element is its own additive inverse (that is, a ^ a — 0). 





| Step *X *y "dr 
Initially a" b 
Step 1 FETTE 
Step 2 "cc 





Step 3 
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Armed with the finctioni ey swap kom Problem 2. 10, you d dicide to write 
code that will reverse the elements of an array by swapping elements from opposite 
ends of the array, working toward the middle. 

You arrive at the following function: 


1 void .reverse_array(int af], int cnp) { 

2 int first, last; 

3 for (first = 0, last -= cnt-1; 

4 first <= last; 

St firsptt,last--) 

6 inplace,swap(&a[first], &a[last]); 
7 } 


When you apply your function to an array containing elements 1, 2, 3, and 4, 
you find the array now has, as expected, e elements 4, 3, ‘2; and i. When you try it 
on an array with elements 1,2, 3, 4, and 5, however, you are surprised o see that 
the array now has eleménts 5, 4, 0; 2, and 1. In fact, you discover tha the code 
always works correctly on arrays of even length, but it sets the middle element to 
0 whenever the array has odd length. 


A. For an array of odd length cnt = 2k + 1, what are the values of variables 
first and last in the final iteration of function reverse_array? 


B. Why does this call to function inplace_swap set the array element to 0? 


C. What simple modificatidn to the codé for reverse array would'eliminate 
this problem? 


One common use of bit-level operations is to implement masking operations, 
where a mask is a bit pattern that indicates a selected set of bits within a word. As 
an example, the mask OxFF (having ones for the least-significant 8 bits) indicates 
the low-order byte of a word. The bit-level operation: x & OxFF yields a value 
consisting of the least significant byte of x, but with all other bytes set t0°0. For 
example, with x = Ox89ABCDEF, the expression would yield OxO000000EF. The 
expression -0 will yield a mask of all ones, regardless of the size of the datå 
representation. The same mask can be written OxFFFFFFFF when data type int is 
32 bits, but it would not be as portable. 





Write C expressions, in terms of variables x, for tħe following values. Your code 
should work for any word Size w > 8. Fot-reference, we show the result of evalu- 
ating the expressions for x — 0x87654321, with w — 32. 


A. The least significant byte of x, with all other bits set to 0. [00000002] 


B. All but the least significant byteof x complemented, with the least significant 
byte left unchanged. [0x7894BC21] 
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C. The least significant byte set to'all ones, and all other bytes of x left un- 
| changed. [0x876543EF] 





The Digital Equipment VAX computer was a very popular machine from the late 

1970s until the late 1980s. Rather than instructions for Boolean operations AND 
t and oR, it had instructions bis (bit set) and bic (bit clear). Both instructions take 
a data word x and a mask word m. They generate a result z consisting of the bits of 
x modified according to the bits of m. With bis, the modification involves setting 
z to 1 at each bit position where m is 1. With bic, the modification involves setting 
z to 0 at each bit position where m is 1. 

To see how these operations relate to the C bit-level operations, assume we 
have functions bis and bic implementing the bit set and bit clear operations, and 
that we want to use these to implement functions computing bitwise operations | 
and ^ , Without using any other C operations. Fill in the missing code below. Hi int: 


Write C ; expressions for the operations bis and bic. A 


/* Declarations of functions implementing operations bis and bic */ 
i int bis(int x, int m); 
A int bic(int x, int m); 


/* Compute x|y using only calls to functions bis and bic ,*/ 
int bool or(int x, int y) { 

int result = |. . J| .j 

return result; 


} " 
t : at ‘ 
/* Compute x^y using only calls to functions bis and bic */ ? 
int bool xor(int x; int y) { 
, int result = |... 1; 
i return result; 
Jes 


2.1.8 Logical Operations in C 


| C also provides.a set of logical operators | |, &&, and !, which correspond to the 
| OR, AND, and NOT operations of logic. These can easily'be confused with the bit- 
level operations, but their behavior is quite different. The logical operations treat 
| any nonzero argument as representing TRUE and argument 0 as representing FALSE. 
They return eithiet 1 or 0, indicating a result of either TRUE of FALSE, respectively. 
| Here are some examples of expression evaluation: 


n 
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Expression Result 
10x41 0x00 
10x00 0x01 
! 10x41 0x01 


0x69 && 0x55 0x01 
0x69 || 0x55 0x01 


Observe that a bitwise operatien will have behavior matching that ofits logical 
counterpart only in the special case in. which the arguments are restricted to 0 
or 1. 

A second important distinction between the logical operators '&£' and ‘| |" 
versus their bit-level counterparts ‘&’ and ‘|’ is that the logical operators do not 
evaluate their second argument if the result of the expression can be determined 
by evaluating the first argument. Thus, for example, the expression a && 5/a will 
never cause a division by zero, and the expression p && *p++ will never cause the 
dereferencing of a null pointer. 





Supp OUS that x ana y have re byte values 0x66 ad 0x39, J, respectively, F Fill i in the 
following table indicating the byte values of the different C expressions: 


Expression Value Expression Value " 
x&y d uc LE x &k y xt 

xly as xiy 

-x | -y TP Ix || !y — 


x&!y 2 pd x kk ~y 





Using only bit-level ind logical operations write a C expression dati is Aaaale 
to x == y. In other words, it will return 1 when x and y are equal and 0 otherwise. 





2.1.9 .Shift:Gperations ne 
(ín 

e also provides a set,of shifs operations for shifting bit patterns to the left and’ io 
the right. For an operand x having bit representation [x,, 1, x 2. .. .,, xo], the C 
expression x << k yields a value with bit representation [xy 4.4. Xy 42; «+ + XQ 
0, ..., O], That is, x is shifted k bits to the left, dropping off the k most significant 
bits and filling the right end with k zeros, The shift amount should be a value 
between 0 and w — 1. Shift operations associate from left to right, so x x< j << k 
is equivalent to (x << j} << k. 

There is a corresponding right shift operation, written in C as x >> k, but it has 
a slightly subtle behavior. Generally, machines support two forms of right shift: 
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Logical. A logical right shift fills the left end with & zeros, giving a result 
[0, Serg 0, Xw-—1: *w-2» *-- xy]. 

Arithmetic. An arithmetic right shift fills the left end with k repetitions of the 
most significant bit, giving a result [xy 4, ..., Xy p Xw—1s Xy-2 -- - Xgl- 
This convention might seem peculiar, but as we will see, it is useful for 
operating on signed integer data. 


As-examples, the following.table shows'the effect of applying the: different 
shift operations to two different valuesvof.dn 8-bit argument x: 


Operation Value] | Value 2 

Argument x [01100011] [10010101] 
x««4 [00140000] ^ [01010000] 
x >> 4 (logical) [00000110] 00001001] 


x >> 4 (arithmetic) — [00000110] — [11171001] 


The italicized digits indicate the values that fill the right (left shift) or left (right 
shift) ends. Observe that all but one entry involves filling with zeros. The exception 
is the case of shifting [10010101] right arithmetically Since its most significant bit 
is 1, this will be used as the fill value. 

The C standards do not precisely define which type of right shift should be 
used with signed numbers-—either arithmetic-or logical shifts may be used. This 
unfortunately means that any code assuming one form or the other will potentially 
encounter portability problems. In practice, however, almost all compiler/machine 
combinations use arithmetic right shifts for signed data, and many programmers 
assume this to be the case. For unsigned data, on the other hand, right shifts must 
be logical. 

In contrast to C, Java has a precise definition of how right shifts should be 
performed. The expression x >> k shifts x arithmetically by k positions, ; while 
x >>> k shifts it logically. 





Fillin the able below atoning ihe effects of the different shift Mime, on tigle- 
byte quantities. The best way to think about shift operations is to work with binary 
representations. Convert the initial values to binary, perform the shifts, and then 
convert back’ to hexadecimal. Each of the dnswers should bé 8 Binary digits or 2 
hexadecimal digits. 
Logical ‘Arithmetic 
x x<<3 X >> 2 X>>2 





dite mininail mem Antes 
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ae i Ay CS LIST RR 9) e AER RTH SPN TS w enegana ennea yum gya me mm me tam Nm ER h 
Í Aside Shifting by k,'for large yalues ofk ^ —**, ho. e 

| For a data type consisting of w bits, what should be the effect of shifting by some value & > w? For 
| example, what should be the effect of coriputirig the following exptéssions, assurning data'type int has 
w = 32: i Eo , 

; 
t 


ints lval = OxFEDCBA98, .<< 32; 
[ng z ara hy rañ Re 

OxFEDCBAS8 5» 36; 

OxFEDCBABBu ‘>> 401 


int aval 
unsigned uval 


t 


5, j E jx "XO x Pn x ix E E & 49 * . 
TheC staidards carefully‘avoid stating what should be done iri stich a case..On many machines, the 
shift instructions considér only the lower log, w bits of the shift amount When shifting a wibit value, and. 


g 


i A . s 
: so the, shift'ámount is computéd as-k nod w. For exdinple, with i= 32, thie above three shifts would 


| be computed as if they were by;amounts 0, 4, and 8, respectively, giving results 
i 3 At "EE ad A a à us i " 
r líal OxFEDCBAO9B pond ate i . . 
j,aval .OXFFEDCBA9 , . è à. E 
uvalt — OxOOFEDCBA, B eh ao " 1 
i This behavior js not guarantëëd for C'Brógrams; however, and só'slift amounts should be kept less than | 
i thé word size; « "e gi ae E Es 
| Java; on thé other hand, specifically réquités that shift'‘amounts should bé computed in the modular 
fashion we have shown. . Ue e Tay gos ` ^g pru 
Ao "oso. aP OX dx. dem by d. E g R. aa . 
M€— 89 one —— yen Se. feos aa 


Aside Operator precedence issues, with shift operatións" 


EI 


^p sigh aH Oey n oM uoa tus ae ee 
| It might be fepipüihg to write the expression 1««2.«- 3««4, intending it to mean (1<<2) + (3««4). How- > 


i ever, in C the former expressior;is equivalent to.1 << (243) << 4, since addition (and subtraction) have 


| higher precedence than shifts. The left-to-right associativity rulé then causes this’ to be parerithesized 


| as(1 << (2+3).) << 4, giving value 512, rather than the intended:52., s 
! Getting the precedęnce wrong in C eżprełsions.js a cómmon.source of program errors, and often 


[uocum difficult.to Spot*by inspection: Wier in’doubt! püt'in parentheses! 
Pra E E 5e * s4 "4 E E A K : 


t hag ato ah 5 2 aĝ he * 


& 





2.2 Integer Representations 


Inthis section, we describe two different ways bits can be used to encode integers.— 
one that can only represent nonnegative numbers, and one that can represent 
negative, zero, and positive numbers. We will see later that they are strongly 
related both in their mathematical properties and their machine-level implemen- 
tations. We also investigate the effect of expanding ór shrinking an encoded inte ger 
to fit a representation with a different length. 

Figure 2.8 lists the mathematical terminology we introduce to precisely de- 
fine and characterize how computers encode and operate on integer data. This 


* 
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Symbol Type Meaning f Page 
B2T,, Function Binary to two's complement 64 
B2U,, Function Binary to unsigned 62 
U2B,, Function Unsigned to binary 64 
U2T, Function Unsigned to two's complement 71 
T2By Function Two's complement to binary 65 
T2Uy Function Two’s complement to unsigned 71 
TMiny Constant Minimum two’s-complement value 65 
TMaxy,, Constant Maximum two's-complement value 65 
UMaxy Constant Maximum unsigned value 63 
+ Operation Two’s-complement addition 90 
ti Operation Unsigned addition 85 
* Operation — Two’s-complement multiplication 97 
x Operation Unsigned multiplication 96 
2 Operation —— Two's-complement negation 95 
T Operation Unsigned negation 89 ul 


Figure 2.8 Terminology for integer data and arithmetic operations. The subscript 
w denotes the number of bits in the data representation. The "Page" column indicates 
the page on which the term is defined. 


terminology will be introduced over the course of the presentation. The figure is 
included here as a reference. 


2.2.1 Integral Data Types 


C supports a variety of integral data types—ones that represent finite ranges of 
integers. These are shown in Figures 2.9 and 2.10, along with the ranges of values 
they can have for "typical" 32- and 64-bit programs. Each type can specify a 
size with keyword char, short, long, as well as an indication of whether the 
represented numbers are all nonnegative (declared as unsigned), or possibly 
negative (the default.) As we saw in Figure 2.3, the number of bytes allocated for 
the different sizes varies according to whether the ptograift is compiled for 32 or 
64 bits. Based on the byte allocations, the different sizes allow different ranges of 
values to be represented. The only machine-dependent range indicated is for size 
designator long. Most 64-bit programs usé an 8-byte representation, giving a much 
Wider range of values that thé 4-byte representation used with 32-bit programs. 
E Q5 n Ue ad ih ^ ren . 

One important feature to hdte in Figures 2.9 and 2.10 is that the ranges are not 
symmetric—the range of negative numbers extends one further than the range of 
positive numbers. We will see why this happens when we consider how negative 
numbers are represented. 


Lf 
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C data type Minimum Maximum 
[signed] char —128 127 
unsigned char 0 255 
short —32,768 32,767 
unsigned short 0 65,535 
int —2,147,483,648 2,147,483,647 
unsigned 0 4,294,967,295 
long —2,147,483,648 2,147,483,647 
unsigned long 0 4,294,967,295 
int32_t —2,147,483,648 2,147 ,483,647 
uint32 t 0 4,294,967,295 
int64 t —9,223,372,036,854,775,808 9,223,372,036,854,775,807 
uint64 t 0 18,446,744,073,709,551,615 


Figure 2.9 Typical ranges for C integral data types for 32-bit programs. 


C data type Minimum Maximum 
[signed] char —128 127 
unsigned char 0 255 
Short —32,768 32,767 
unsigned short 0 65,535 
int —2,147,483,648 2,147,483,647 
unsigned 0 4,294,967,295, A 
long —9,223,372,036,854,775,808 9,223,372,036,854,775,807 
unsigned long 0 18,446,744,073,709,551,615 
int32 t —2,147,483,648 2,147,483,647 
uint32 t 0 4,294,967,295 
int64 t —9,223,372,036,854,775,808 9,223 372,036,854,775,807 — X 
uint64 t O  18,146,744,073,709,551,615 


Figure 2.10 Typical ranges for C integral data types for 64-bit programs. 


+ 


The C standards define minimum ranges of values that each data type must 
be able to represent: As shown in Figure 2.11, their ranges are thé same or smaller 
than the typical implementations shown in Figures 2.9 and 2.10. In particular, 
with the exception of the fixed-size data types, we see that they require only a 
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| x EEG e P ts YyÀ& san p ra M: Ha uw o yO" gogo ROK 
New to C?  ‘Signéd and-unsighed numhbers.in Ç, CEN and Java Sop o. v Pos 
; au EL TN R s TOT, 0. sx Xue y od 

Both Cand C++ support sighed (thé default) and unsigned numbers. Jaya supports only signed humbers”*: 
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C data type | Minimum Maximum 


[signed] ‘char ~127 127 
unsigned char 0 255 


short —32,767 32,767 
unsigned short 0 65,535 


int —32,767 32,767 
unsigned 0 65,535 


long —2,147,483,647 2,147,483,647 
unsigned long 0 4,294,967,295 


‘ 

i 
int32 t —2,147,483,648 2,147,483,647 
uint32 t 0 4,294,967,295 


int64 t —9,223,372,036,854,775,808 9,223,372,036,854,775,807 
uint64_t 0 18,446,744,073,709,551,615 


Figure 2.11 Guaranteed ranges for C integral data types. The C standards require 
that the data types have at least these ranges of values. 


a 


symmetric range of positive and negative numbers. We also see that data type int 
could be implemented with 2-byte numbers, although this is mostly a throwback 
to the days of.16-bit machines. We also see*that size long can be implemented 
with 4-byte numbers, and it typically is for 32-bit programs. The fixed-size data 
types guarantee that the ranges of values will be exactly those given by the typical 
numbers of Figure 2.9, including the asymmetry between negative and positive. 


2.2.2 Unsigned Encodings ui 

Let us consider an integer data type of w bits. We write a bit vector as either X, to 
denote the entire vector, or as [xy Xu-2 + ^-: xo] to denote the individual bits 
within the vector. Treating X as a number written in binary notation, we obtain the 
unsigned interpretation of X.In this encoding, each bit x; has value 0 or 1, with the 
latter case indicating that value 2i should be included as part of the numeric value. 
We can express‘this interpretation as'a function B2U,, (for “binary to unsigned,” 
length w): d 


E ! 
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Figure 2.12 

Unsigned number 
examples for w — 4. 
When bit i in the binary 
representation has value T, 
it contributes 2! to the 
value. 





0123.45 86 7 8 9 10d1 12 13 14 15 16 





PRINCIPLE: Definition of unsigned encoding 
For vector ¥ = [x 4, xy. 2, ... xol: 
wl . 
í B2U,G) = Y x2! (2.1) 
izÜ 
E 


In this eguation, the notation = meansthat the left-hand side is defined to be 
equal to the right-hand side. The function B2U,, maps strings ofizeros and,ones 
of length w to nonnegative integers. As examples, Figure 2.12 shows the mapping, 
given by B2U, from bit vectors to integers for the following cases: 


B2U4([0001) = 0-23+0-2?4+0-2!41.2 = 0404041 = 1 
B2U,(0101) = 0-234+1-2?4+0.2!41.20 04+44041 = 5 
B2U,((1011) 1-23 4-0.22 1-21 4 1.29 8+0+2+1 = 11 
B2U4([1111) = 1-23 41-2241.2!41.20 = 8444241 = 15 
(2.2) 


II 
il 


In the figure, we represent each bit position i by a rightward*poihting blue bar of 
length 2'. The numeric value associated with a bit vector then equals the sum of 
the lengths of the bars for which the corresponding bit values are 1. 

Let us consider the range of values thát can be represented using w bits. The 
leást value is given by bit vector [00 - - - 0] having integer value 0, and the greatest 
value is giveri by bit vector [11 -- - 4] having infeger value UMax,, = Y 13 2! 
2” — 1. Using the 4-bit case as an example, we have UMax, = B2U,({1111) 
2* — 1 = 15. Thus, the function B2U w can be defined as a mapping B2U,,: (0, 1)" — 
{0,..., UMax,,}. n , | 

The urísignéd?binary representation has the important property thatvevery 
number between O'and 2" — 1 has a unique encoding asa w-bit value. For example; 
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there is only one representation of decimal value 11 as an unsigned 4-bit number— 
namely, [1011]. We highlight this as a mathematical principle, which we first state 
and then explain. 


PRINCIPLE: Uniqueness of unsigned encoding 
Function B2U,, is a bijection. N 


The mathematical term bijection refers to a function f that goes two ways: 
it maps a value x to a value y where y — f (x), but it can also operate in reverse, 
since for every y, there is a unique value x such'that f(x) — y. This is given by 
the inverse function f -1 where, for our example, x = f —1(y), The function B2U,, 
maps each bit vector of length w to a unique number between 0 and 2" — 1, and 
it has an inverse, which we call U2B,, (for “unsigned to binary”), that maps each 
number in the range 0 to 2" — 1 to a unique pattern of w bits. 


2.2.3 Two's-Complement Encodings 


For many applications, we wish to represent negative values as well. The most com- 
mon computer representation of signed numbers is known as two’s-complement 
form. This is defined by interpreting the most significant bit of the word to have 
negative weight. We express this interpretation as a function B2T , (for "binary 
to two's complement" length w): 


PRINCIPLE:. Definition of two's-complement encoding 


For vector X ="[X,)-1, Xy 2; --- Xok 
w-2 . 
B2T y(%) = —x, 42" + DS x2 (2.3) 
i=0 


The most significant bit x, , is also called tlie sign bit! Its “weight” is 2-1, 
the negation of its weight in an unsigned representation. When the sign bit is set 
to 1, the represented value is negative, and when set to 0, the value is nonnegative. 
As examples, Figure 2.13 shows the mapping, given by B2T, from bit vectors to 
integers for the fpllowing.cases: 


B2T,((0001}) = -—0.2540.2240.2!41.22 = 0404041 = 1 
B2T4(0101) = —0-23"1-22+0-2'41-2 = 0444041 = 5 
B2T,((1011) j= —1-23+0-224+1-2141-29 = -8+04+241 -5 
B2T4{1111) = -1-2341,22+1-2!41-29 = -8444+24+1 = -l 

(2,4) 


In the figure, we indicate that the sign bit has negative weight by showing it as 
a leftward-pointing gray bar. The numeric value associated with a bit-vector is 
then given by the combination of the possible leftward-pointing gray,bar and the 
rightward-pointing blue bars. 
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Figure 2.13 Co = 


Two's-complement 0.4 
number examples for 









w — 4. Bit 3 serves as a 2'-2| 
sign bit; when set to 1, it 2021 B 
i 3. 
contributes —2° = —8 to -8-7-6-5-4-3-2-10123456768 
the value. This weighting 


is shown as a leftward- [0001] 


pointing gray bar. [0101] 







[1111] Și 


We see that the bit patterns are identical for Figures 2.12 and 2.13 (as well as 
for Equations 2.2 and 2.4), but the values differ when the most significant bit is 1, 
since in one case it has weight +8, and in the other case it has weight —8. 

Let us consider the range of values that can be represented as a w-bit two’s- 
complement number. The least representable value is given by bit vector [10 - - - 0] 
(set the bit with negative weight but clear all others), having integer value 
TMin,, = —2"-, The greatest value is given by bit vector [01 - - - 1] (clear the bit 
with negative weight but set all others), having integer value TMax,, = ae 2= 
27-1 — 1, Using the 4-bit case as an example, we have TMing = B2T4({1000) = 
—2? = —8 and TMax, = B2T4(0111]) = 2? +214 29=442+41=7. 

We can see that B2T,, is a mapping of bit patterns of length w to numbers be- 
tween TMin,, and TMax,,, written as B2T „: (0, 1)" > (TMin,, ..., TMax,,}. As 
we saw with the unsigned representation, every number within the representable 
range has a unique encoding as a w-bit two’s-complement number. This leads to 
a principle for two’s-complement numbers similar to that for unsigned numbers: 


PRINCIPLE: Uniqueness of two’s-complement encoding 
Function B2T „ is a bijection. a 


We define function 72B,, (for “two’s complement to binary”) to be the inverse 
of B2T „. That is, for a number x, such that TMin,, x x x TMax,, T2B,,(x) is the 
(unique) w-bit pattern that encodes x. 





Assuming w — 4, we can assign a numeric value to each possible hexadecimal 
digit, assuming either an unsigned or a two's-complement interpretation. Fill in 
the following table according to these interpretations by writing out the nonzero 
powers of 2 in the summations shown in Equations 2.1 and 2.3: 
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: 
H 
Ht | 
| i | Hexadecimal Binary B2U4() B2T 4(x) 
OxE [1110] 28.-22.-21214  —25422421——-2 
| | 0x0 EEE EENE 
| 0x5 M PARETA mu 
" 0x8 AER, 
| OxD eee "e NOTER 
| OxF Sees ee ENSE 
Figure 2.14 shows the bit patterns and numeric values for several important 


integers in terms of the values of UMax,,, TMin,,, and TMax,,. We will refer 
to these three special values often in the ensuing discussion. We will drop the 
subscript wand refer to the values UMax, T Min, and TMax when w can be inferred 
from confext oris not central to the discussion. F 

A few poitits are worth highlighting : about these numbers. First, as observed 

in Figures 2.9 and 2.10, the two’s- complement range is, asymmetric: |TMin\ = 
| |TMax| + 1; that is, there is no positive counterpart to T Min. As we shall see, this 
| oF leads to some peculiar properties, of two’s-complement arithmetic and.can be the 
| ou source of subtle program bugs. This as metry arises because half the bit patterns 
| (those with the sign bit set to 1) represent negative numbers, while half (thosé 
with the sign bit set to 0) represent nonhegative numbers. Since 0 is nonnegative, 
i this means that it can represent one less positive number than negative. Second, 
i | the maximum unsigned value is just over twice the'maximum two's- -complement 
value: UMax — 2TMax-- 1. All ofthe'bit patterns that denote negativé numbérs in 
i two ’s-complement notation become} positive values in an unsigned representation. 


| numbers for different word sizes. The first three give the ranges of representable 
i 


a 


y i 








Word size w 

i Value 8 16 32 64 
UMax, O0xFF OxFFFF ^ OxFFFFFFFF OxF FFFFFFFFFFFFFFF 
255 65,535 4294,967295  18,446,744,073,709,551,615 
TMin, 0x80 0x8000 0x80000000 _ 0%8000000000000000 
-128  —32,768  -2,147,483,648,  —9,223,372,036,854,775,808 
TMax,  Ox7F OxTFFF OxTFFFFFFF ' OxTFFFFFFFFFFFFFFF 
127 32,761 2,147,483,647 9,223,372,036,854,775,807 
j -1 OxFF  — OxFFFF OxFFFFFFFF OxFFFFFFFFFFFFFFFF 
0 Ox00 ^ Ox0000 000000000 0x000000000000000 


T 


1 Figure-2.14 Important numbers. Both numérit values and:hexadecimal representa- 
tions are shown. 
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Aside More on fixed-size integer types s 


For some programs, it is essential that data types be encoded using representations with specific sizes. 
For example, when writing progranis to enable a machine to communicate over the Internet according 
toa standard protocol, it is important to have.data types compatible with those spécified by the protocol. 
We have seen that sonie C data types, especially long, have different ranges on different machines, 
and in fact the C standards only specify the minimum ranges for any data type, not the exact ranges. 
Although we can choose data types that will be compatible with standard representations on most 
machines, there is no guarantee of portability. 

We have already encountered the 32- and 64-bit versions of fixed-size integer types (Figure 2.3); 
théy are part ofa larger.class of data types. The ISỌ ‘C99 stdndard introduces this class of integer types 
in the file stdint.h. This file defines a set of data types with declarations of” the fotm intN_+, and 

uintN_t, specifying N-bit signed and.ursigned integers, for different values of N. "The exact values of 
* N are implementation dependent, but, most compilers allow values of 8, 16, 32, and 64. Thus, we can 
unambiguously declare an urisigned 16-bit variable by giving it type uinti6 t, and a signed variable 
of 32 bits as int32 t. 

Along with these data types ‘are a | set of macros defining the minimum and maximum values for 
each value of N. These have hanies of-the form “INT PIN, INTN MAX, and UINTN . MAX. 

Formatted printing with fixed-width types requires use of macros that expand into formatstrings 
ina system-dependent manner. So, for example, the values of variables x and y of type int32 t and 
, üint64 t can be printed by the” ‘following call to  printf: 


H 


me 


an 


Feet 


mmo oe 


ee 


fe 
printf ("x =. y PRId32° ^, y = 4" PRIué4"\n", x, yh 


When compiled as'a 64-bit prograth, macrb PRId32 expands to the string "à", while PRIu64-expands 
to the } pair of strings "1" "u". When the € préprocéssor encountérs a sequetice of string cónstants 
separated only by Spaces (or other whitespace charactérs), it concatenates them, together. Thus, the 
į above call to printf becomes 


‘printf ("% x zd, yu wiu\n* ,- -x5 y Ld 3^ E à x Er: 
* Üsing the i macros, engures that a correct, fgrmatstríng will be generated regardless, of how the code is 
, 9p iled. * “ > # 4" og me de v AS 


sd 


Figure 2.14 also shows the representations of constants —1 and 0. Note that —1 
has the same bit representation as UMax—a string of all ones. Numeric value 0 is 
represented as a string of all zeros in both representations. 

The C standards do not require signed integers to be represented in two's- 
complement form, but nearly all machines do so. Programmers who are concerned 
with maximizing portability across all possible machines should not assume any 
particular range of representable values, beyond the ranges indicated in Figure 
2.11, nor should they assume any particular representation of signed numbers. 
On the other hand, many programs are written assuming a two's-complement 
representation of signed numbers, and the "typical" ranges shown in Figures 2.9 
and 2.10, and these programs are portable across a broad range of machines 
and compilers. The file <limits.h> in the C library defines a set of constants 
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Aside Alternativé representations of signed numbers .. è 


E 


y * ae 
LA 


Lm 


Piere are two m standard. representations for signed numbers:: ap F 


ms 
k 


Sign magnitude. The 
should, be .giveri negative or positive weight: "T 


Both of these reptesentatiops have the curious property that there are two different encodings of the 
number 0, For both representations, [00- - -Ôj is interpreted as--0, The value —0'can ‘be, represented 
in sign-magnitude form‘as [10° - : 0] and in ones’ complement as [iles - 1]. Althoügh machiries based 
on ones ?-comple£neht 1 repr sentations were built in the past, almost all modern machines use two’s 
complement. We will see that sign- magnitude encoding i is used with floating poist numbers, 

Note the different position of apostrophes; two's complement versüs ones ' complement. The term 
"two's complement” arises from the fact that: for-nonnegative x we coinpute a‘w-bit representation 
of —x ds 2" — x (a single two.) The term, *ones' complement” comes from ‘the property: that we can 
compute +x in this notatidn as [111 -> - 1] x (multiple ones). 
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“Opes? complerhent: "This is thé same :as two's complenient, except | "hat. the.most. epica: bit has 
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mošt signifcapt. bit is a sign bit that, determines whether.the remaining bits 
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delimiting the ranges of the different integer data types for the particular machine 
on which the compiler is running. For example, it defines constants INT MAX, INT. 
MIN, and UINT MAX describing the ranges of signed and unsigned integers. For a 
two's-complement machine in which data type int has w bits, these constants 
correspond to the values of TMax,,, TMin,,, and UMax,,. 

The Java standard is quite specific about integer data type ranges and repre- 
sentations. It requires a two's-complement representation with the exact ranges 
shown for the 64-bit case (Figure 2.10). In Java, the single-byte data type is called 
byte instead of char. These detailed requirements are intended to enable Java 
programs to behave identically regardless of the machines or operating systems 
running them. 

To get a better understanding of the two 's-complement representation, con- 
sider the following code example: 


short x = 12345; 
Short mx = -x; 


Show bytes((byte pointer) Ex, sizeof(short)); 
Show bytes((byte pointer) &mx, sizeof(short)); 


vA fk wn 
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12,345 —12,345 53,191 
Weight Bit Value Bit Value Bit Value 
1 1 1 1 1 1 1 
2 0 0 1 2 1 2 
4 0 0 1 4 1 4 
8 1 8 0 0 0 0 
16 1 16 0 0 0 0 
32 1 32 0 0 0 0 
64 0 0 1 64 1 64 
128 0 0 1 128 1 128 
256 0 0 1 256 i 256 
512 0 0 1 512 1 512 
1,024 0 0 1 1,024 1 1,024 
2,048 0 0 1 2,048 4 2,048 
4,096 1 409% 0 0 0 0 
8,192 1 — 8192 0 0 0 0 
16,384 0 0 1 16,384 1. 16384 
432,768 0 0 i1  —32,768 1 32,768 
Totdl 12,345 —12,345 53,191 


Figure 2.15 Two's-complement representations of 12,345 and —12,345, and 
unsigned representation of 53,191. Note that the latter two have identical bit 
representations. 


When run on a big-endian machine, this code prints 30 39 and cf c7, indi- 
cating that x has hexadecimal representation 0x3039, while mx has hexadeci- 
mal representation OxCFC7. Expanding these into binary, we get bit patterns 
[0011000000111001] for x and [1100111111000111] for mx. As Figure 2.15 shows, 
Equation 2.3 yields values 12,345 and —12,345 for these two bit patterns. 





In Chapter 3, we will look at listings generated by a disassembler, a program that 
converts an executable program file back to a more readable ASCII form. These 
files contain many hexadecimal numbers, typically representing values in two's- 
complement form. Being able to recognize these numbers and understand their 
significance (for example, whether they are negative or positive) is an important 
skill. 

For the lines labeled A-I (on the right) in the following listing, convert the 
hexadecimal values (in 32-bit two’s-complement form) shown to the right of the 
instruction names (sub, mov, and add) into their decimal equivalents: 





E 


D - = L——- — imer n povera vart DES Tee mpra" - 
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| 4004d0: 48 81 ec eO 02 00 00 sub $0x2e0, %rsp A. 
ae 4004d7: 48 8b 44 24 a8 mov ~0x58(%rsp), trax B. 
! 4004dc: 48 03 47 28 add 0x28(%rdi) , %rax c. 
4004e0: 48 89 44 24 dO mov Arax,-0x30(Xrsp) D. 
| 4004e5: 48 8b 44 24 78 mov 0x78 (%rsp) ,%rax E. 
| 4004ea: 48 89 87 88 00 00 00 mov %rax, 0x88 (Zrdi) E. 
4004f1: 48 8b 84 24 f8 01 00 mov Ox1f8(Àrsp),Arax G. 
4004f8: 00 
4004£9: 48 03 44 24 08 add 0x8 (4rsp) , 4rax 
4004fe: 48 89 84 24 cO 00 00 mov Vrax,0xcO(Àrsp) H. 
| 400505: 00 
400506: 48 8b 44 d4 b8 mov -0x48(4rsp,4rdx,8),%rax TI. 





2.2.4 Conversions between Signed and Unsigned 


C allows casting between different numeric data types. For example, suppose 
variable x is declared as int and u as unsigned. The expression (unsigned) x 
converts the value of x to an unsigned value, and (int) u converts the value of u 
to a signed integer. What should be the effect of casting signed value to unsigned, : 
or vice versa? From a mathematical perspective, one can imagine several different 
conventions. Clearly, we want to preserve any value that can be represented in : 
both forms. On the other hand, converting a negative value to unsigned mightyield i 
| zero. Converting an unsigried value that is too large to be represented in two's- | 
complement form might yield TMax. For most implementations of C, however, 
the answer to this question is based on a bit-level perspective, rather than on a 
numeric one. 
| For example, consider the following code: 
Í 


SSS SS eT 


i 1 short int , v, = -12345; 
unsigned short.uv = (unsigned short) y; 
i 3 printf("v = 4d, uv = Zu\n", v, uv); 
» X egt a . . d , t 
ì When run on a two’§cdinplement machine, it generates the following output: 
Fr p 


v = -12345, uv = 53191 


| What we see here is that the effect of casting is to keep the bit values identical 

but change how these bits are interpreted. We saw in Figure 2.15 that the 16-bit 
| two's-complement répresentation' of —12,345 is identical Yo the 16-bit unsigned 
| représentàtioh ‘of 53,191. Casting from short to unsigned short changed the 


numeric value, but riot the bit representation. 
Similarly, cUnsider thé ‘following code: 


^ 


1 unsigned u = 4294967295u;  /* UMax */ t 
| 2 ‘int tu = (int) u; 


| | E 
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3 printf("u = Xu, tu = %d\n", u, tu); 
When run on a two’s-complement machine, it generates the following output: 
= 4294967295, tu = -1 


We can see from Figure 2.14 that, for a 32-bit word size, the bit patterns represent- 
ing 4,294,967,295 (UMaxz) in unsigned form and —1 in two’s-complement form 
are identical. In casting from unsigned to int, the underlying bit representation 
stays the same. 

This is a general rule for how most C implementations handle conversions 
between'signed and unsigned numbers with the same word size—the numeric 
values.might change, but the bit patterns do not. Let.us capture this idea in 
a more mathematical form. We defined functions U2B,, and 72B,, that map 
numbers to their bit representations in either unsigned or two's-complement form. 
That is, given an intéger x in the range 0 <x <UMax,,, the function U2B, (x) 
gives the unique w-bit unsigned representation of x. Similarly, when x is,in the 
range TMin,, S x '« TMax,,, the function T2B,,(x) gives the unique w-bit two's- 
complement representation of«x. 

Now define the function 72U,, as T2U „(x) = B2 U w(T2B,, (x)). This function 
takes a number between TMin,, and TMax,, and yields a number between 0 and 
UMax,,, where the two numbers have identical bit representations, except that 
the argument has a two's-complement representation while the result is unsigned. 
Similarly, for x between 0 and UMax,,, the funetion U2T „, defined as U2T „(x) = 
B2T,(U2B,, (x); yields the number having the same two’s-complement represen- 
tation as the unsigned representation of x. 

Pursuing our'earlier examples, we see from Figure 2.15 that 72U46(—12,345) 
= 53,191, and that U275(53,191) = —12,345. That is, the 16-bit pattern written in 
hexadecimal as 0xCFC7 is both the two’s-complement representation of —12,345 
and the unsigned representation of 53,191. Note also that 12,345;-+ 53,191 = 
65,536 = 216, This property generalizes to a relationship between the two nu- 
meric values (two’s complement and unsigned) represented by a given bit pat- 
tern. Similarly, from Figure 2.14, we see that T2U32(—1) = 4,294,967,295, and 
U2T3,(4,294,967,295) = —1. That is, UMax has the same bit representation in un- 
signed form as does —1 in two’s-complement form. We can also see the reonip 
between these two numbers: 1 + UMax,, = 2". 

We see, then, that function 72U describes the conversion of a two’s- 
complement number to its unsigned counterpart, while U2T converts in the op- 
posite direction. These describe the effect of casting between these data types in 
most C implementations. 

i 





Lino the table you filled i in em rris Proben 2. 17, fill in the Pine table 
describing the function T2U,: 





—————— — — —- A 
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x T2U 4G) 


The relationship we have seen, via several examples, between the two's- 
complement and unsigned values for a given bit pattern can: be expressed as a 
property of the function 72U: 


PRINCIPLE: Conversion from two's complement to unsigned 
Fór x such that TMini, < x x TMax,: 


x-F2", «0 
x, x20 


(2.5) 
E 


T2U, (x) = | 


For example, we saw that T2U 16(—12,345) = —12,345 + 216 — 53,191, and also 
that T2U ,(—1) = —1 + 2" = UMax,,. l l 

This property can be derived by comparing Equations 2.1 and 2.3. 

ru 

DERIVATION: Conversion from two's complement to unsigned 
Comparing Equations2.1 and 2.3, we can see that-for bit pattern x, if we compute 
the difference B2U,,,(%) — B2T,,(x), the weighted sums for bifs from 0 fo w — 2 will 
cancel each’éther, leaving a value B2U,,(X) 2 B2Fi,(x) = x, ,Q*7t- -29-lyu- 
x, 42". This givés a relationship B2U wt) = B2T,() + xy 42". We-therefore 
have 


B2U,,(T2B,,(x)) = T2U, (x) =x + £242" (2.6) 


In a two’s-complement representation of x, bit x,,_; determines whether or not x 
is negative, giving theawo cases of Equation 2.5. E 


f n cg 

As examples, Figure 2.16 compares how functions B2U and B2T assign values 
to bit patterns for w — 4. For the two's-complement case, the most significant bit 
serves as the sign bit, which we diagram as a leftward-pointing gray bar. For the 
unsigned case, this bit has positive weight, which we show as a rightward-pointing 
black bar. In going from two's complement to unsigned, the most significant bit 
changes its weiglit from —8 to +8..As a consequence, the values that are nega- 
tive in a two's-complement ‘representation increase by 2^ = 16 with ar unsigned 
representation. Thus, —5 becomes +11, and —1 becomes +15. 








Figure 2.16 

Comparing unsigned 
and two’s-complement 
representations for w = 4. 
The weight of the most 
significant bit is —8 for 
two’s complement and +8 
for unsigned, yielding a net 
difference of 16. 


Figure 2.17 

Conversion from two’s 
complement to unsigned. 
Function T2U converts 
negative numbers to large 
positive numbers. 
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42» 2" 5 Unsigned 
Two's 0 
complement 0 
D Lowi 


Figure 2.17 illustrates the general behavior of function T2U. As it Shows, when 
mapping a signed number to its unsigned counterpart, negative numbers are con- 
verted to large positive numbers, while nonnegative. numbers remain unchanged. 





Practice: Problem, 2, 


Solving Problem 2.19. 







Explain how Equation 2.5 applies to the entries in the table you generated when 


1 





Going in the other direction, we can state the relationship between an un- 
signed number u and its signed counterpart U2T,,(u): 


PRINCIPLE: Unsigned to two’s;complement conversion 
For u such that 0 € u < UMax,,: 


^ U2T,,(u) = | 





u, uxTMax,, 


u—-2", u- TMax, ven) 


73 








Figure 2.18 ow 

Conversion from 

unsigned to two's 

complement. Function Unsigned 2"* 

"d U2T converts numbers 

greater than 27-1 — 1 to 
| negative values. 
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complement 
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This principle can be justified as follows: 


DERIVATION: Unsigned to two's-complement conversion 


Let ü = U2B,,(u). This bit vector will also be the two's-complement representation 
of U2T „(u). Equations 2.1 and 2.3 can be combined to give 


U2T (4) = —uy 2" +u (28) 


In the unsigned representation of u, bit u,,_; determines whether or not uis greater 
than TMax,, = 2”~! — f, giving the two cases of Equation 2.7. i u 


The behavior of fünction'U2T is illustrated in Figure 2.18. For small 
4 (< TMax,,) numbers, the conversion from uñsigned to signed preserves the nu- 
| meric value. Large (> TMax,,) numbers are converted to negative values. 
To summarize, we considered the effects of converting in both directions 
j between unsigned and two’s-complement representations. For values x in the | 
range 0 x x < TMax,, we have T2U,,(x) = x and. U2T,;(x) zx-"That is, num- 
bers in this range have identical unsigned and two's-complemént representations. 
| For values outside of this range, the conversions either add or subtract 2". For 
example, we have 72U,,(—-1) = —1 + 2" = UMax,,—the negative number clos- 
est to zero maps to the largest unsigned number. At the other extreme, óne 
can see that T2U,, (TMir,) ='—2°7! 4:2" 27-1 & TMax,, + 127 “the most neg- 
t ative number maps to an unsigned number just outside the range of-pósitivé i 
two’s-complement numbers. Using the example of Figure 2.15, we can see that |! 
T2U 16 (—12,345) = 65,536 + —12,345 = 53,191. 
P 


bd 


2.2.5 Signed versus Unsigned in C 


As indicated in Figures 2.9 and 2.10, €’supports both signed and unsigned arith- 

metic for all of its integer data types. Although the C standard does not spec- 

ify a particular representation of signed numbers, almost all machines use two’s 
complement. Generally, most numbers are signed by default. For example, when 
declaring a constant such as*12345 or 0x1A2B, the value is considered signed. 
! Adding character ‘U’ or ‘w’ as a suffix creates an unsigned constant; for example, E 
12345U or 0x1A2Bu. 
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C allows conversion between unsigned and signed. Although the C standard 
does not specify precisely how this conversion should be made, most systems 
follow the rule that the underlying bit representation does not change. This rule has 
the effect of applying the function U2 T, when converting from unsigned to signed, 
and 72U,, when converting from signed to unsigned, where w is the number of 
bits for the data type. 

Conversions can happen due to explicit casting, such as in the following code: 


int tx, ty; 
unsigned ux, uy; 


tx = (int) ux; 
uy = (unsigned) ty; 


UR ow NA 


Alternatively, they can happen implicitly when án expression of one type is as- 
signed to a variable of another, as in the following code: 


int tx, ty; 
unsigned ux, uy; 


tx = ux; /* Cast to signed***/ 
uy = ty; /* Cast to unsigned */ 


a Aa w N a 


When printing numeric values with printf, the directives %d, %u, and %x 
are used to print a number as a signed: decimal, an unsigned decimal, and in 
hexadecimal format, respectively. Note that printf does not make use of any 
type information, and so it is possible to print a value of type int with directive 
Zu and a value of type unsigned with directive %d. For example, consider the 
following code: 


1 int x = -1; 

2 unsigned u =. 2147483648; /* 2 to the 31st */ 
3 

4 printf("x = %ù = %d\n", x, x); 

5 printf("u = žu = %d\n", u, v); 


When compiled as a 32-bit program, it prints the following: 


n 


-1 
-2147483648 


4294967295 
2147483648 


X 
u 


In both cases, printf prints the word first as if it represented an unsigned number 
and second as if it represented a signed number. We can see the conversion 
routines in action: T2U32(*1) = UMax4, = 232 — 1 and U2T 37 (271) 231 — 232 = 
~231 = TMing. : 

Some possibly nonintuitive behavior ariges due to C’s handling of expres- 
sions containing combinations of signed and unsigned quahtities. When an op- 
eration is performed whére one operand is signed and the other is unsigned, C 
implicitly casts the signed argument to unsigned and performs the operations 
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Expression Typé Evaluation 

| i P 0 == OU Unsigned 1 
| 2. 7 « 0 Signed 1 

aT -1 < OU Unsigned o* 
2147483647 > ~2147483647-1 Signed 1 

2147483647U > -2147483647-1.  .Unsigned 0* 

2147483647 > (int) 2147483648U Signed 1% 
-1 > -2 Signed 1 
(unsigned) -1 > -2 Unsigned 1 


Figure 2.19 Effects of C promotion rules. Nonintuitive cases are marked by ‘*’. When 
either operand of a comparison is unsigned, the other operand is implicitly cast to | 
unsigned. See Web Aside DATA:TMIN for why we write TMin;; as -2,147 ,483,647-1. 


assuming the numbers are nonnegative. As we will see, this convention makes 
little difference for standard arithmetic operations, but it leads to nonintuitive 
results for relational operators such as < and >. Figure 2.19 shows some sample 
relational expressions and their resulting evaluations, when data type int has a 
32-bit two’s-complement representation. Consider the comparison -1 « OU. Since 
the second operand is unsigned, the first one is implicitly cast to unsigned, and 
hence the expression is equivalent to the comparison 4294967295U < OU (recall 
that 72U ,(—1) —,UMax,,), which of course is false. The other cases can be under- 
k stood by similar analyses. r 
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| Assumins ilice expressions are evaluated When Mugen a 32-bit program onama- 
chine that uses two’s-complement arithmetic, fill in the following table describing 
the effect of casting arid relational operations, in-the style of Figure 2.19: 

' 
Expression Type Evaluation 

| -2147483647-1 == 2147483648U PELAA EE? 
-2147483647-1 < 2147483647 Jg EERE 

! -2147483647-10 < 2147483647 HT TIER iT ias 
-2147483647-1 < -2147483647 SES PERIE 


-2147483647-1U < -2147483647 


2.2.6 Expanding the Bit Representation-of a Number 


One common operation is to convert between integers having different word sizes 
| while retaining the same numeric value. Of course, this may-not be possible when 
i the destination data type is too small to represent the desired vahie. Converting 

from a smaller toa larger data type, however, should always be possible. : 
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Web Aside DATA:TMIN Writing TMmidC ty 77. : š 


d 
| In Figure.2.19 and-in-Problem 2924, we.carefülly wróte tlie value'of TMIISds -25147 ,483 464721. Why-* 
not simply write it as either -2, 147,483, 648 or,0x80090000? Lodking’at-the C header-file-1imits*h, 


23 


À : e x %, . » " * “ag 

* we see that they'use-a‘similar.method’as we hayé to-write ‘EMingp and: TMg: 2 po 
* sp : 5 TA AM NUN " T E ci aia * 2 u á 

+ a NE E gy Pe DONT AMETE F 
ý /* Minimum: and maximum yalués à. ‘signed: int. cari Hold. « i E # 
i define INT.MAX ^'2147483647  , “h ^% s4 z ES 7 ^ 
s x tie pa dee ae ` As at ^ 
| #Hdefine* INTsMIN « (-INTEMAXo5,4) 9. A » E: eo" " 
y y 3 Ears % mW viet Banh & a & 


F HT © on on ^ ska % yet adu. * "TTE te, g Ao pk f x ENT 
‘Unfortunately, d ċurióús interaction, between tlié Asymmetry of, the tvo's:complementreprésenta-. i 
b tion'and the,conyersion rules of C forces usto write TM iaz in this unusual way. Although understanding 

S. vim y d & 4v Roos 23 & S Pt w aut fus L SEU puli. vibra i xod wi 
this isse requires ys.to delve.into one-ofithe niurkier:corners of the G:language standards, it will help 
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To convert an unsigned number to a larger data type, we can simply add 
leading zeros to the representation; this operation is known as zero extension, 
expressed by the following principle: = 


PRINCIPLE: Expansion of an unsigned number by zero extension 


Define bit vectors i = [u,. 1, u,. 2, ..., ug] of width w and z/ =[0,..., 0, i 
4,2, ug] of width w', where w > w. Then B2U,, (i) = B2U (i). a 


This principle can be seen to follow directly from the definition of the unsigned 
encoding, given by Equation 2.1. 

For converting a two’s-complement number to a larger data type, the rule 
is to perform a sign extension, adding copies of the most significant bit to the 
representation, expressed by the following principle. We show the sign bit Xy in 
blue to highlight its role in sign extension. 


PRINCIPLE: Expansion of a two's-complement number by sign extension 


Define bit vectors X = [xy_1, xy. 2. . .. , xg] of width w and x! = [£5 6 io ts 
Xw—1> Xw-2; +++» Xo] of width w’, where w' > w. Then B2T,(X) = B2T,,(^). E 


As an example, consider the following code: 


Short sx - -12345; /* -12345 */ 
unsigned short usx = sx; /* 53191 */ 
int x = sx; /* -12345 */ 
unsigned ux - usx; /* 53191 */ 


printf ("sx %d:\t", sx); 

show bytes((byte pointer) £sx, sizeof(short)); 
printf("usx = %u:\t", usx); 

show_bytes((byte_pointer) &usx, sizeof (unsigned short)); 
printf("x = %d:\t", x); 


oO ON DH BW =m 


= 
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11 show bytes((byte pointer) &x, sizeof (int)); 
12 printf("ux = %u:\t", ux); 
| 13 show_bytes((byte_pointer) &ux, sizeof (unsigned)) ; 


When run as a 32-bit program on a big-endian machine that uses a two’s- 
complement representation, this code prints the output 


sx = -12345: cf c7 
usx - 53191: cf cT 
-12345: ff ff cf cT 
ux = 53191: 00 00 cf c7 


Lal 
LI 


| We see that, although the two's-complement representation of —12,345 and the 
i unsigned representation of 53,191 are identical for a 16-bit word size, they dif- 
fer for a 32-bit word size. In particular, —12,345 has hexadecimal representation 
OxFFFFCFC7, while 53,191 has hexadecimal representation 0x0000CFC7. The for- 
mer has been sign extended—16 copies of the most significant bit 1, having hexa; 
| decifħal representation OxFFFF, have been added as leading bits. The latter has 
been extended with 16 leading zeros, having hexadecimal representation 0x0000. 
$ As an illustration, Figure 2.20 shows the'result of expanding from word size 
| w =3to w = 4 bysign extension. Bit vector [101] represents the value —4 + 1 = —3. 
Applying sign extension gives tit vector [1101] representing the value —8 + 4+ 
! 1 = —3. We can see that, for w = 4, the combined value of the two most significant 
bits, —8 + 4 = —4, matches the value of the sign bit for w = 3. Similarly, bit vectors 
[111] and [1111] both represent the value —1. 
With this as intuition, we can now show that sign extension preserves the value 
of a two's-complement number. 


I Figure 2.20 FT ee T 

Examples of sign IPTE 
Eten one =3 -2 =—4 i 
to w = 4. For w =4, the 
combined weight of the 
upper 2 bits is -8 + 4 = —4, 
matching that of the sign 
bit for w = 3. 








E 13 
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DERIVATION; :Expansion of a two’s:complement number by sign extension 
Let w= w + K. What wé want to prove ig that — ^ 
H 
B2T wp (Xa e Xyods Xy Xw-2» sees Xo) = B2T uis Xw—2:s -3 xo) 
—ÀM— 


k times 


The proof follows by induction on k. That is, if we can prove that sign extending 
by 1 bit preserves the numeric value, then this property will hold when sign 
extending by an arbitrary number of bits. Thus, the task reduces to proving that 


B2T iam Xw-1s Xw—2e es xo) = B2T (es Xw-2s ee xo] 


Expanding the left-hand expression with Equation 2.3 gives the following: 
¥ 


w-1 
BIT ausis xii Xw-2 - - -> Xo) = —Xy_12" + x2! 
i-0 
w-—2 ] 
= =Xy-12" + 4552/71 T ba x;2 
e +i=0 


w-2 
= —JXy-1 (2" — a) + > xj2! 
i=0 


w-2 ' 
= —x, 4271 + X xz 
i=0 


= B2T,, ([Xy-1, Xw-2» ->> xo) 


The key property we exploit is that 2" — 2%-1 = 2-1, Thus, the combined effect 


. H 


of adding'a bit of Weight +2” and of conVerting the bit having weight —2"- to be 


rone NUR E T 
one with wéight'2"—! is to préserve the Original numéric value. a 








how that each of the fdlléwing bit Vectors iša twd 
of —5 by applying Equation 2.3: 

A. [1011] 

B. [11011] 

C. [111011] 


Observe that the second and third bit vectors can be derived from the first by sign 
extension. 
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One point wotth making is that the relative order of conversion’ from one 
data size to another and between unsigned and signed can affect the behavior of 
a program. Consider the following code: 


1 
short sx = -12345; /* -19345  */ 
unsigned uy = sx; /* Mystery! */ 


printf("uy = Xu: Nc", uy); 
Show bytes((byte pointer) &uy, sizeof(unsigned)); 


Uu 0 FN = 


€ 
When run on a big-endian machine, this code causes the following output to be 
printed: 


uy = 4294954951: ff ff cf c7 


This shows that, when converting from short to unsigned, the program first 
changes the size and then the type. That is, (unsigned) sx is equivalent to | 
(unsigned) (int) sx, evaluating to 4,294,954,951, not (unsigned) (unsigned | 
short) sx, which evaluates to 53,191. Indeed, this convention is required by the 
C standards. 








Consides the following C functions: 


int funi(unsigned word) f 
return (int) ((word << 24) >> 24); 






} 







int fun2(unsigned word) { 
return ((int) word << 24) 35 24; 





j 





Assume these are executed as a 32-bit program on a machine that uses two’s- 
complement arithmetic. Assume also that right shifts of signed values are peg 
formed arithmetically, while right shifts of unsigned values are performed logically. 






A. Fill in the following table showing the effect of these functions for several 

example arguments. You will find it more convenient to work with a hexa- 

x decimal representation. Just remember that hex digits 8 through F have, their 
most significant bits equal to 1. 








Ww fun1 (w) fun2(w) 


0x00000076 Bes per ee : 
0x87654321 E "—— ae 
0x000000C9 PROPER RN Sistas EN 
OxEDCBA987 MUN en En ud 









4 
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2.2.7 Truncating Numbers 


Suppose that, rather than extending a value with extra bits, wé reduce the number 
of bits répresenting a number. This occurs, for example, in the following code: 


1 int x = 53191; 

2 short sx = (short) t£; /*'-12345 x/ 

3 int y -'5í; ' o ''y« -12845 x/ 
id a 


u 4 


Casting x to be short will truncate a 32-bit int to a 16-bit short. As we saw 
before, this 16-bit pattern is’ the two’s-complement representation of —12,345. 
When casting this back to int, sign extension will set the high-órder 16 bits to 
ones, yielding the 32-bit two's-complement representation of —12,345. 

When truncating a w-bit number x = [x 1,4, 5, ..., xo] to a k-bit number, 
we drop the high-order w — k bits, giving a bit vector 7’ = Dip X2, ++.» Xol. 
Truncating a number can alter its value—a form of overflow. For an unsigned 
number, we can readily characterize the numeric value that will result. 


PRINCIPLE: Truncation of an unsigned number 

Let x be the bit vector [x,,. 1, Xu-2. ««.» Xo]. and let x’ be the result of truncating 
jt to k bits: x’ = [xy 1, x4 5, ..., xg] Let x = B2U,,(X) and x’ = B2U,(X^). Then 
x' 2 x mod 2*. E 


The intuition behind this principle is simply that all of the bits that were 
truncated have weights of the, form 2/, where i > k, and therefore each of these 
weights reduces to zero under the modulus operation. This is formalized by the 
following derivation: 


DERIVATION: Truncation of àn unsigned number 
Applying the modulus operation to Equation 2.1 yields 


V 
w-1 


B2U bes peeks xol) mod 2* = p s mod 2* 
i=0 1 


k—1 1 
= |» s mod 2* 


i=0 


k1 
= SY x,2! 
i=0 
= BZU.(xi i. x2, --- Xo) 


In this derivation, we make use of the property that 2! mod 2* = 0 for any i > k. 
a 


A similar property holds for truncating a two’s-complerhent number, except 
that it then converts the most significant bit into a sign bit: 








ET amr 
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PRINCIPLE: Truncation of a two’s-complement number 


Let x be the pit vector, [kw] Xw-2 - - - » X0], and let x be the result of truncating 
it to k bits: x’ = [xk Xp_2,..+, Xol. Let x= B2T VG), and x’ = B2T,(x’). Then 
x= U2T {x mod 2^. E 


In this formulation, x mod 2* will be a number between 0 and 2* — 1. Applying 
function U2T to it will have the effect of converting the most significant bit xy. 
from having weight 2*—' to having weight —2*-1 We can see this with the example 
of converting value x = 53,191-from int fo short. Since 216 — 65,536 > x, we have 
x mod 216 = x. But when we convert this number, to a 16-bit two's-complement 
number, we get x!-='53;191 — 65,536 = —12,345. ; 


DERIVATION: Truncation of a two’s-complement number 


Using a similar argument to the one we used for truncation of an unsigned number 
shows that 


MERD , 
B2T [x P Xw—2» - «+ XoD mod 2* = B2U,4Q s Xk-2 -- +» Xo) 


That is, x mod 2* can be represented by an unsigned number having bit-level rep- 
resentatidn [xj_1, x. - - - , xj]. Converting this to a two's-complemént numbet 
gives x’ = U2T,(x mod 2%). E 
Summarizing, the effect of truncation for unsigned numbers is 
B2U sa DXe-2-.- .,xgp = = B2U (Ly D w- f- xol) mod 2k (2.9) 
ru 
while the effect for two's-complement numbers i is 


B2T «(x15 Xk-2 ++ + Xo) = U2T4G2U xvi Xu-2. «+» Xo),mod 25 (2.10) 





Suppose we truncate : a4- bit Value (represented by hei digits 0 through F) toa TA 
bit value (represented as hex digits 0 through 7.) Fill in the table below showing 
the effect of this truncation for some cases, in terms of the unsigned and two’s- 
complement interpretations of those bit patterns. 

Ty 


Hex Unsigned Two’s complement 
Original Truncated Original Truncated Original Truncated 
0 0 0 munus he nf 0 — deris 
2 2 2 ORE 2 PIU eras 
9 1 9 NOCT -7 
B 3 11 RET -5 
F Y 15 P fert. -1 E 


Explain how Equations 2.9 and 2. 10 apply to these cases. 


4 37€ 
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2.2.8 Advice on Signed versus Unsigned ' 


As we have seen, the implicit casting of signed to unsigned leads to some non- 
intuitive behavior. Nonintuitive features often lead to program bugs, and ones 
involving the nuances of implicit casting can be especially difficult to see. Since the 
casting takes place without any clear indication in the code, programmers often 
overlook its effects. 

The following two practice problems illustrate some of the subtle errors that 
can arise due to implicit casting and the unsigned data type. 






Consider the following code that attempts to sum the elements of an array a, where 
the number of elements is given by parameter length: 
/* WARNING: This is buggy code */ 
float sum .elements(float a[], unsigned length) { 
int i; 
float-result - 0; 


1 
2 
3 
4 
5 
6 for (i = 0; i <= length-1; i++) 
7 result += a[i]; 

8 return result; 

9 


} 


When run with argument length equal to 0, this code should return 0.0. 
Instead, it encounters a memory error. Explain why this happens. Show how this 
code can be corrected. 





string is longer than another. You decide to make use of the string library function 
strlen having the following declaration: 


/* Prototype for library function strlen */ 
size_t strlen(const char *s); 


Here is your first attempt at the function: 


/* Determine whether string s is longer than string t */ 
/* WARNING: This function is buggy */ 
int strlonger(char *s, char +t) { 

return strlen(s) - strlen(t) > 0; 


} 
1 


When you tèst this on some sample data, things do-not seem to work quite 
right. You investigate further and determine that, when compiled as a 32-bit 
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program, data type size_t is defined (via typedef) iñ header file stdio.h tobe 
uns igned. 

A. For what cases will this function produce an incorrect result? 

B. Explain how this incorrect resült comés about. 

C: Show how to fix the code so that it will work reliably. 





We have seen multiple ways in which the subtle features of unsigned arith- 
metic, and especially the implicit conversion of signed to unsigned, can lead to 
errors or vulnerabilities. One way to avoid such bugs is to never use unsigned 
numbers. In fact, few languages other than C support unsigned integers. Appar- 
ently, these other language designers viewed them as more trouble than they are 
worth. For example, Java supports only signed integers, and it requires that they 
be implemented with two 's-complemént arithmetic. The normal right shift oper- 
ator >> is guaranteed to perform an arithmetic shift. The special operator??? 1s 
defined to perform a logical right shift. 

Unsigned values are very useful when we want to think óf words as just col- 
lections of bits with no numeric interpretation. This occurs, for example, when 
packing a word with flags describing various BooleAn conditions. Addresses are 
naturally unsigned, so systems programmers find unsigned'types to be helpful. 
Unsigned values are also useful when implementing mathematical packages for 
modular arithmetic and for multiprecision arithmetic, in which numbers are rep- 
resented by arrays of words. 


2.3 Integer Arithmetic 


Many beginning programmers are surprised to find that adding two positive num- 
bers can yield a negative result, and that the comparison x < y can yield a different 
result than the comparison x-y < 0. These properties are artifacts of the finite na- 
ture of;computer arithmetic. Understanding the nuances of computer arithmetic 
can help programmers write more reliable code. 


2.3.1 Unsigned Addition 3 


Consider two nonnegative integers x and y, such that 0 <x, y < 2". Each of 
these values can be represented by a w-bit unsigned number. If wé compute'their 
sum, however, we have a possible range 0 < x + y x 2"*! — 2. Representing this 
sum could require w + 1 bits. For example, Figure 2.21 shows a; plot of the func- 
tion x + y when x and y have 4-bit representations. The, arguments (shown on 
the horizontal axes) range from 0 to 15, Dut the sum ranges from 0 to 30. The 
shape of the function is a sloping plane (the function is linear in both dimen- 
sions). If we were to maintain the sum as a (w + 1)-bit number and add it to 
anothérevalue, we may require w + 2 bits, and so on. This continued ‘word size 


Li 
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Figure 2.21 Integer addition. With a 4-bit word size, the sum could require 5 bits. 


inflation” means we cannot place any bound on the word size required to fully rep- 
resent the results of arithmetic operations. Some programming languages, such 
as Lisp, actually support arbitrary size arithmetic to allow integers of any size 
(within the memory limits of the computer, of course.) More commonly, pro- 
gramming languages support fixed-size arithmetic, and hence operations such 
as “addition” and “multiplication” differ from their counterpart operations over 
integers, 

Let us define the operation * for arguments x and y, where 0 < x, y <2”, 
as the result of truncating the integer sum x + y to be w bits long and then 
viewing the result as an unsigned number. This can be characterized as a form 
of modular arithmetic, computing the sum modulo 2" by simply discarding any 
bits with weight greater than 27-1 in the bit-level representation of x 4- y. For 
example, consider a 4-bit number representation with x — 9 and y — 12, having 
bit representations [1001] and [1100], respectively. Their sum is 21, having a 5-bit 
representation [10101]. But if we discard the high-order bit, we get [0101], that is, 
decimal value 5. This matches the value 21 mod 16 — 5. 











86 Chapter 2 Representing and Manipulating Information 


Aside Security vulnerability in getpeername E 


In 2002, programmers involved-in the FreeBSD open-source operating:systems project realized that 
their implementation öf the getpeernamie library function had a security vülnerability. A simplified 
version of their code went something like this: 


1 /* : 
2 * Illustration of code vulnerability similar to that found in 
3 * FreeBSD's implementation of, getpeername Ó 

4 */ 

5 " $ 
6  /* Declaration of library function memcpy, */ 

4 void *memcpy(void *dest, void, *src, size_t n); 

8 E: 

9  /* Kernel memory region holding usér-acéessible data */ 

10,  #define KSIZE 1024 f 

11 char 'kbuf [KSIZE] ; 

12 


13  /* Copy at most maxlen bytes from kernel region to user’ buffer */ 
14 int copy. from:kernel(void *user dest, int.maxlen) 1 


15 /* Byte count len is minimum of buffer size 'and'máxlen */ 

16 int len = KSIZE « maxleh ? KSIZE : maxlen; 

7 memcpy(user_dest, kbuf, len);  ' 3 ' 
18 return len; ” 

19 3} 


i5 * A. Ut * 

In this code, we show the prototype for library function memcpy on line 7, which is designed to copy 
a specified number of bytes n from one region of memóry to another., 

The function copy from kernel, starting at line 14, is designed to copy some of the data main- 
tained by the operating systeth kernel to 4'desigriated región of memory accéssible to the user. Most a 
of the data structures maintained by the'kernelshoüld not be'readable by auser, since they may cop- 4 
tain sensitive information about’other users and about otherjobs running on the system, but the region i 
shown as kbuf was intended to be ore that the user could fead. Thé parameter maxlen is intended to be * 
the length of the buffer allocated by the user and indicated by argument uset dest; The computation i 
at line 16 then makes sure that no more bytes are cópied than'aré available in either the Source’or thé į 
destination buffer. a” - | 

Suppose, however, that some malicious programmer writes code that calls copy. from kernelwith : 
a negative value of maxlen. Then the minimum corüputation'on line, I6 will compute this value for len, 
which will then be passed a$ the parameter n to memcpy. Note, hówever, that patameter n is declared as 
having data type size_t. This data type is declared (via typedef} in the library-file stdio rh. Typically, it 
is defined to be unsigned for 32-bit programs and unsigned long for 64-bit-programs. Since argument. 
nis-unsigned, memcpy will treat'it as a very large pdsitive number and attempt to'copy that many bytes 
from the Kernel region to the*user's buffer. Copying that many bytes (at least 231) will not actually 
work, because the program will encounter invalid áddresses in the procéss, but:thé program could read 
regions of the kernel memoty for whicli itis not authorized. 


pei 


^ 
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Aside Security Vulherability jf getpeername (continued). 
n t s & & x 

We can see that this problemi arises due to the mismatch betw&én, data types: in one place the 
length parameter ig signed; in ‘another place itisunsigned. Such.mismatches cán be’a source of bugs 
and, as this example'sliows, can evenJead to secürity vulnerabilities. Fortunately, there were no reported 
cases where a programmer had exploited the vulnerability in FreeBSD. They issued a seéurity advisory 
"FreeBSD-SA.02:38.signed-error" advising system administrators on how to apply a patch that would 
remove the vulnerability. The bug‘cari ‘bé. fixed by declaring parameter naxlen to copy. from kernel 
to be of type size_t, to be consistent with parameter n of memcpy. We should also declare local variable 


jh * 


len'and the return-value to be of typé size_t. “4 i 
Se E E E wm ue Pe E O, cer e a O carat eee 


We can characterize operation +5, as follows: 


PRINCIPLE: Unsigned addition 
For x and y such that 0 < x, y <2": 


x+y, x+y<2” Normal 


43 y= 2.11 
ad dun 2" «x y«2"t Overflow a) 


The two cases of Equation 2.11 are illustrated in Figure 2.22, showing the 
sum x + y on the left mapping to the unsigned w-bit sum x +y ¥ on the right. The 
normal case preserves the value of x + y, while the overflow case has the effect of 
decrementing this sum by 2”. 


DERIVATION: Unsigned addition 


In general, we can see that if x + y <2”, the leading bit in the (w + 1)-bit represen- 
tation of the sum will equal 0, and hence discarding it will not change the numeric 
value. On the other hand, if 2” < x + y <2+1, the leading bit in the (w + 1)-bit 
representation of the sum will equal 1, and hence discarding it is equivalent to 
subtracting 2” from the sum. E 


An arithmetic operation is said to overflow when the full integer result cannot 
fit within the word size limits of the data type. As Equation 2.11 indicates, overflow 


Xy. 


owe Overflow 





x+y 





Normal 


Figure 2.22 Relation between integer addition and unsigned addition. When x + y 
is greater than 2” — 1, the sum overflows. 
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Figure 2.23 Unsigned addition. With a 4-bit word size, addition is performed 
modulo 16. 


occurs when the two operands sum to 2" or more. Figure 2.23 shows a plot of the 
unsigned addition function for word size w — 4. The sum is computed modulo 
24 — 16. When x + y < 16, there is no overflow, and x +4 y is simply x + y. This is 
shown as the region forming a sloping plane labeled “Normal.” When x + y > 16, 
the addition overflows, having the effect of decrementing the sum by 16. This is 
shown as the region forming a sloping plane labeled "Overflow." 

When executing C programs, overflows are not signaled as errors. At times, 
however, we might wish to determine whether or not overflow has occurred. 


PRINCIPLE: Detecting overflow of unsigned addition 


For x and y in the range 0 x x, y < UMax,,, let s =x +y y. Then the computation 
of s overflowed if and only if s « x (or equivalently, s « y). E 


As an illustration, in our earlier example, we saw that 9 +4 12 = 5. We can see 
that overflow occurred, since 5 < 9. 
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DERIVATION: Detecting overflow of unsigned addition 


Observe that x + y > x, and hence if s did not overflow, we will surely have s > x. 
On the other hand, if s did overflow, we have s =x + y — 2". Given that y<, 
we have y — 2” <0, and hence s = x + (y — 2") — x. E 





Write a function with the following prototype: 





/* Determine whether arguments can be added without overflow */ 
int uadd_ok(unsigned x, unsigned y); 


This function should return 1 if arguments x and y can be added without 
causing overfiow.. F 


t 


if 

Modular addition forms a mathematical.stfucture known.as anabelian group, 
named after the Norwegian‘mathematician Niels Henrik Abel (1802-1829). That 
is, it is commutative (that’s where’ the “abelian” part coines in) and associative; 
it has an identity. element 0, and every element has an additive inverse. Let us 
consider the set of w-bit unsigned numbers with addition operation +p: For every 
value x, there must be some value se such that -y X +4, x =0. This additive 
inverse operation can be characterized as follows: 


PRINCIPLE: Unsigned negation 


For any number x such that 0 < x < 2”, its w-bit unsigned negation —" x is given 
by the following: 


(2.12) 
a 


This result can readily be derived by case analysis: 


DERIVATION: Unsigned negation 


i E 4 ES hae 
When x = 0, the additive inverse is clearly.0, For x > 0, consider, the value 2” — x. 
Observe that this number is in the range,0 « 2" — x <2". We can also see that 


(x 4-29 — x) mod 2" = 2" mod 2" = 0. Hence it is the inverse of x under UH 











a 
tir. Mi 





We can represent a bit pattern of length w =,4.with a single hex digit. For a 
unsigned interpretation of these digits, use Equation 2.12 to fill in the following 
table giving the values and the bit representations (in hex) of the uhsigned additive 
inverses of the digits shown. 
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Hex Decimal Decimal Hex 
0 
5 
8 ah 
D dy 
F de) LRL 


2.3.2 Two's-Complement Addition 


With two’s-complement addition, we must decide what to do when the result is 
either too large (positive) or too small (negative) to represent. Given integer 
values x and y in the range —2"-1 < x, y < 2-1 — 1, their sum is in the range 
—2" < x + y x 2" — 2, potentially requiring w + 1 bits to represent exactly. As 
before, we avoid ever-expanding data sizes by truncating the representation to w 
bits. The result is not as familiar mathematically as modular addition, however. 
Let us define x +, y to be the result of truncating the integer sum x--- y to be w 
bits long and then viewing the result as a two’s-complement number. 


PRINCIPLE: Two’s-complement addition 
For integer values x and y in the range —2"71 < x, y x 27-71 — 1: 


x4y-27, Wl<x4+y Positive overflow 1 

x+ y= x+y, —2"lex-Ey«2"3 Normal (2.13) 
Xx-y-2U, x+y<-2"-! Negative overflow i 

a 


This principle is illustrated in Figure 2.24, where the sum x + y isshown on the 
left, having a value in the range —2" < x + y x 2" ~ 2, and the result of truncating 
the sum to a w-bit two's-complement number is shown on the right. (The labels 
“Case 1” to “Case 4" in this figure are for the case analysis of the formal derivation 
of the principle.) When the sum x + y exceeds TMax,, (case 4), we say that positive 
overflow has occurred. In this case, the effect of truncation is to subtract 2” from 
the sum. When the sur x Fy is éss than TMin, (case 1), we say that negative 
overflow has occurred. In this case, thé effect of truncation is to add 2” fò the surh. 

The w-bit two's-complement sum of two number’ has the exact game bit-level 
representation as the unsigned sum. In fact, most computers use the same machine 
instruction to perform either unsigned or signed addition. 


DERIVATION: Two’s-Complement addition 


Since two’ 's-complement addition has the exact same bit-level representation as 
unsigned addition, we can characterize the operation +i, as one of converting its 
arguments to unsigned, performing unsigned addition, and then converting back 
to two’s complement: 


n i 
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Figure 2.24 

Relation between integer 
and two’s-complement 
addition. When x + y is 
less than —2¥-1, there is a 
negative overflow. When 

it is greater than or equal 
to 2"-1, there is a positive 
overflow. 





Negative overflow 


_2W 


xt, y = U2T,,(T2U,,(x) +4, T2U,,(9)). (2.14) 


By Equation 2.6, we can write T2U w(x) as x,..12" +x and T2U,,(y) as 
Yu 12" + y. Using the property that * is simply addition modulo 2", along with 
the properties of modular addition, we then have 


x+, y = UIT. (T2U y(x) #8, T2U y(y)) 
= O27 xu" +X + Yy_12” + y) mod 2”] 
= U2T „l + y) mod2”] 


The terms x,,. 2" and y„—12” drop out since they equal 0 modulo 2". 

To better understand this quantity, let us define z’as the integer sum z 2 x + y, 
z’ as z' =z mod 2", and z” as z" = U2T,,(z'). The value z” is equal to x +, y. We 
cari divide the analysis into four cases as illustrated in Figure 2.24: 


1, —2” <z < —2"-1 Then we will have z’ = z+2”. This gives 0 x z’ < —27-1 4. 
2" —2"-l. Examiriing Equation-2.7; we see that a is in the range such that 
z” — z'. This is the case of negative overflow. We have added two negative 
numbers x and y (that's the only-way' we can have z < —2"-1) and obtained 
a nonnegative result z” = x + y + 2v, 

2. -2"-1 <z <0. Then we will again have z’ =z + 2”, giving —2¥-14 2» = 
28-1 <z <2”. Examining Equation 2.7, we see that z’ is in such a range that 
2” =z’ — 2", and therefore 2” = z/ — 2% z +2” — 2" =z. That is, our two’s- 
complement sum z” equals the integer sum x + y. 

3. 0 <z « 27-1. Then we will have z’ =z, giving 0 < z’ < 2"-1, and hence z” = 
z’ = z. Again, the two’s-complement sum z” equals the integer sum x + y. 

4, 2"-! <z <2". We will again havé z’ =z, giving 2"-1 < z’ < 2”. But in this 
range we have z” = z’ — 2”, giving z = x + y — 2". This is the case of positive 
overflow. We have added two positive numbers x and y (that’s the only way 
we can have z > 2”~!) and obtained a negative result z” = x + y — 2", E 
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| | x y x+y x+y y Case 
j -8 -5 -13 3 1 
[1000] [1011] [10011] [0011] 
| —8 -8 -16, 0 1 
[1000] [1000] [10000] [0000] 
l -8 5 -3 "a3 2 
| [1000] [0101] [11101] — [1101] 
2 5 7 7 3 
[0010] ^ [0101] [00111] [0111] 
| | 5 5 10 6 4 
i [0101] [0101] [01010] [1010] 
| 
f Figure 2.25 Two's-complement addition examples. The bit-level representation of | 
i the 4-bit two’s-complement sum can be obtained by performing binary addition of the 





operands and truncating the result to 4 bits. 





i 


DE As illustrations of two's-complement addition, Figure 2.25 shows some exam- 
f . ples when w — 4. Each example is labeled by the case to which it corresponds in 
| the derivation of Equation 2.13. Note that 2+ = 16, and hence negative overflow 
yields a result 16 more than the integer sum, and positive overflow yields a result 16 


| i less. We include bit-level representations of the operands and the result. Observe 
| that the result can be obtained by performing binary addition of the operands and 

truncating the result to 4 bits. 
Figure 2.26 illustrates two 's-complement addition for word size w — 4. The 


! operands range between —8 and 7. When x + y < —8;two's-complement addition 
| has a negative overflow, causing the sum to be incremented by 16. When —8 < 
i x + y <8, the addition yields x + y. When x + y > 8,.the. addition has a positive : 


overflow, causing the sum to be decremented by.16. Each of these three ranges 
forms a sloping plane in the figure. à 
| Equation 2.13 also lets us identify the cases where overflow has occurred; 


PRINCIPLE: Detecting overflow in two’s-complement addition 


i For x and y in the range TMin,, <x, y € TMax,,let s =x +, y. Then the compu- 
k tation of s has had positive overflow if and only if%. > 0 and y > 0 but s <0. The 
! computation has had negative overflow if and only if x < Qand y < Obuts> 0. m 


| Figure 2.25 shows several illustrations of this principle for w = 4. The, first 
entry shows a case of negative overflow, where two negative numbers sum to a 
positive one. The final entry shows a case of positive overflow, where two positive 

numbers sum to a negative one. p 
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Figure 2.26 Two's-complement addition. With a 4-bit word size, addition can have a 
negative overflow when x + y <--8 anda positive overflow when x 4- y » 8. 


DERIVATION: Detecting overflow of two's-complement addition 


Let us first do the analysis for positive overflow. If both x > 0 and y > O but s <0, 
then clearly positive overflow has occurred. Conversely, 


(1) that x > 0 and y > 0 (otherwise, x + y«TMax 
Equation 2.13). A similar set of arguments holds for 


positive overflow requires 
w) and (2) that s « 0 (from 
negative overflow. a 





cte P 





Pract solutiohpade a2) we A 
Fill the style of Figure 2.25. Give the integer values of 
the 5-bit arguments, the values of both their integer and two’s-complement sums, 
the bit-level representation of the two’s-complement sum, and the case from the 
derivation of Equation 2:13. 


x y x+y x+y Case 





[10100] [10001] Tes 
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x y xy x+y Case 
[11000] [100] | n — 
| | d Gorai. - oiod Awe eee aa 


[00010] [00101] 


[01100] [DIU] cake. (Le cei 


| 
| 





Write a function with the following giotolype: t 











: /* Determine whether arguments can be added without overflow */ ^ 
d int tadd ok(int x, int y); 


A . 
This function should return 1 if arguments x and y can be added without 
causing overflow. 





two's-complement addition and presents you with the following implementatiort 


| 
Your Co WOHAE pete impatient wiih your vemm of the overflow conditions for 
d of tadd. ok: 


/* WARNING: This code is buggy. */ 
int tadd ok(int x, int y) ( 
int sum = x+y; 
retürn (sum-x == y) && (sum-y == x); 


/* Determine whether arguments can be, added without overflow */ 
ft 
1 


( " 


t E You look at the code and laugh. Explain why. 





Youa are reassigned the task of onang cad for a function tsub_ok, with arguments 
x and y, that will return 1 if computing x-y does not cause overflow. Having just 
written the code for Problem 2.30, you write the foliowing: 





/* Determine whether arguments can be subtracted without overflow */ 


‘ /* WARNING: This code is buggy. */ . f 
int tsub_ok(int x, int y) { 
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return tadd ok(x, -y); 


For what values of x and y will this function give incorrect results? Writing a 
correct version of this function is left as an exercise (Problem 2.74). 


(i 


2.3.3 Two's-Complement Negation 


We can see that every number x in the range TMin,, < x € TMax,, has an additive 
inverse under +t, which we denote |, x as follows: 
PRINCIPLE: Two’s-complement negation 


For x in the range TMin, x x < TMax,,, its two’s-complement negation ~, x is 
given by the formula 


" TMin,, x= TMin, 


-—X, x > TMin,, (215) 


t 
w 


That is, for w-bit two’s-complement addition, TMin,„ is its own additive in- 
verse, while any other value x has'—x as its additive inverbe. 


DERIVATION:. Two’s-complement negation 


Observe that TMin,, + TMin,, = —2*-1 + —2~1 = —2”. This would cause nega- i 
tive overflow, and hence TMin,, +, TMin, = —2" + 2" — 0. For values of x such 
thatx >‘T'Min,,, the value —x can also be represented as a w-bit two's-complement 
number, and their sum will be —x + x = 0. B 





We can cepresent a bit omm of length w = 4 with a single hex digit. for a two’s- 
complement interpretation of these digits, fill in the following table to determine 
the additive inverses of the digits shown: | 


x Taa 


Hex Decimal Decimal Hex 


| 
| 
| 





What do yok observe ‘about the'bit patterns generated by two's-complement 
and unsignéd (Problem 2.28) negation? 
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" P deo Bs v 
w Web, Asidé DATA:TNEG Bit-level representatiori of two &-complerneht negation oar 
Li t 
" There are:several clever ways to determine the two ’s-complement, negation, of a “value represented 
i at the bit level. Thé following two’ tgcliniques are} Both Useful, stich*as When oné "afcounters the 'value ; 
Fi 0! Oxfffffffa when debugging a prograni, ‘and they lend Insight into the nátüré: of tlie two "s-coripleitiént ' 


representation. 
Oné technique for performing iwó's-complement négation at the bit leveli isto complenient the bits : 
‘and then increment,the result. Inc, wean state that for any integer.valüe x computing the expressions 4 


»x and ~x ^1 will give | identical ecd e m pe i 
Here-are some examples With à 4-bit word; size: EUNUS ins sa fog 2 | 
r» d * Ty. Po Xa g tom sg "h^ tg 
, X j zi ies Jjncr(-X) 
SS MM MÀ GÓP pf eh " pra Ei wks BY 
40101] 5 oio  -6 ot] -5 , E * d 4 

fu] 7 "D00]  -à Ron] -7 T i : ] 

[ [100 -4 y 10011] E [0100] , 4 m ag 
i [000] 0 Ql -1 [mor 0 4 , l 
[i000  -8* (oy * 7 = [1000] negi uu dE ; 


For our earlier example, we know that.the complement, of Oxf is 0x0 and tlie COME of Oxa, 
is 0x5, and'so Oxtfitif fa, js tle. twoji- -ompleimenf'i Igprésentatiop of = 6i. 14 ^7 

A second way to perform two 's-complement, negation of a number & d$; Based onsplittiig the bit j 
| i vector into two parts. Let k be the position of the rightmost 1, so the bit-level representatión'of x bas the § 

form [xy 4, xy 2» Xe L 0,. -u (This is uei asdong’as xe 0.) The negation is ther written i 


| in binary form as [~xy_1, gsi" gaps, .. , 0]. That is, we complement each t bit to the left of. 
bit position k. am + x we ó 
We illustrate this idea With sbine pitnuinbers where we e highlight fg rightmost patterns T ne fi 1 
initalics: $ Siat t in Be ^ í 
a n 
n x | eX? e [ES * 
; | [u00]  -4 ovo 4 eH ox. oso xt ' ] 
| [1000 | —8 4 M000 Be t. 8e à 4 Bot Wa. 
[0107] 5 [oup  -5 * edo 070 fe chos C0 08S Xt d.d r 
fii] 7 [i007].  -7 et deg AED MT 


E AS ther - dee atm LE ug 


2.3.4 Unsigned Multiplication 


3 Integers x and y in the range 0 < x, y < 2" — 1 can be represented as w- -bit un- 
signed numbers, but their product x - y can range between 0 and (2" — 1)? = 
22w _ 2w11 + 1. This could require as many as 2w bits to represent. Instead, un- 
signed multiplication in C is defined to yield the w-bit value given by the low-order 
w bits of the 2w-bit,jnteger product. Let us denote this yalue as x *» y. 

Truncating an unsigned number to w bits is equivalent | to computing its value 
modulo 2”, giving the following: 
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PRINCIPLE: Unsigned multiplication 
For,x,and y such that 0 < », y < UMax,,: 


x y —(x- y) mod 2” (2.16) 
E 


2.3.5 Two's-Complement Multiplication 


* 

Integers x and y in the range 277! < x, y « 27-1 — 1 can be represented as w-bit 
two's-complement numbers, but their product.x - y can range between —-2"-1. 
(271 — 3) = —22u2 42w-band 229-1. ,2w-1.2229-2 This could require as 
many as 2w bits to represent in two's-complement'form: Instead, signed multi- 
plication in C generally is performed by truncating the 2w-bit product to w bits. 
We denote this value as x *u y, Itgncating a two’s-complement number to w bits 
is equivalent to first computing its value modulo 2" and then converting from 
unsigned to two's complement, giving the following: 


PRINCIPLE: Two's-complentent multiplication 
For x and y such that TMin;, € x, y < TMax,,: 


x * y — U2T,,((x - y) mod 2") (2.17) 
a 


We claim that the bit-levél representation of the product operation is identical 
for both unsigned and two’s-complement multiplication, as stated by the following 
principle: 


PRINCIPLE: Bit-level equivalence of unsigned and two’s-complement multipli- 
cation 


Let X and y be bit vectors‘of length w. Define integers x and y as the values repre- 
sented by these bits in two's-complement form: x = B2T',(X) and y = B2T,, (9). 
Define nonnegative integers x’ and y’ as the values represented by these bits in 
unsigned form: x’ = B2U,,(X) and y' = B2U,,(¥). Then 


T2B,,(x €, y) = U2B,,(x’ +o y) 
d a 


As illustrations, Figure 2.27 shows the results of multiplying different 3-bit 
numbers. For’ each pair of bit-level operands, we perform both unsigned and 
two’s-complement multiplication, yielding 6-bit products, and then truncate these 
to 3 bits. The unsigned truncated product always equals x - y mod 8. The bit- 
level representations of both truncated products are identical for both unsigned 
and two's-complement multiplication, even though the full 6-bit representations 
differ. 
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Mode x y xy Truncated x - y 
Unsigned 5 [ol 3 [oi] 35 poun 7 [i1] 
Two'scomplement — —3 [101] 3 [011] $-9 [110111]  -1 [111] 
Unsigned 4 [100] 7 [lit] 28 [011100] 4 [100] 
Two'scomplement —4 [100]  —1 [111] 4 [000100 —4 [100] 
Unsigned 3 [011] 3 [011] 9 -[001001]. 1 [001] 
Two's complement 3 [011] 3 [011] 9 [001001] 1 [001] | 


Figure 2.27 Three-bit unsigned and two's-complement multiplication examples: 
Although the bit-level representations of the full products may differ, those of the 
truncated products are identical. 


DERIVATION: Bit-level equivalence of unsigned and two’s-complement multipli- : 
cation d ? | 
From Equation 2.6, we have x’ = x + x, .2" and y' = y + y,-12". Computing the | 
product of these values modulo 2” gives the following: | 
(x! - y) mod 2" = [(x + x, 129) O + Yy—12")] mod 2" (2.18) — 

= [x ‘yr (xu 1 + Yw-1X)2" * 3539932 ^] mod 2" | 

— (x - y) mod 2" | 

The terms with weight 2" and 22" drop out due to the modulus operator. By Equa- i 


tion 2.17, we have x *t, y = U2T,(@ : y) mod 2”). We can apply the operation 
T2U,, to both sides to get ' 


T2U, (x €, y) 7 T2U (U2T ,,((x - y) mod 2”)) = (x - y) mod 2" 


Combining this result with Equations 2.16 and 2.18 shows that T2U „(x €t, y) = 
(x^; y') mod 2" = x’ #4, y'. We can then apply U2B,, to both sides to get 


+ 


U2B,(T2U „(x *', y)) = T2By( €, y) = U2B,,(x' *4, y^ 











Pa of x 


iplying different 3-bit num- 


bers, in the style of Figure 2.27: 


Mode x y xy Truncated x «y 

Unsigned _ fol — [01] .— —— —- —— 

Two'scomplement -—— [100 | ——— uoil es ——— — ——  —— 
i 1 

Unsigned av^ SO — u “ae —, eee een 


— M 


Two'scomplement .. [010 | ——— [Ui]! bine a xS 
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Mode x y xy 
Unsigned meu8.. mal] z—-— H10L o See t I 
Two’s complement |... [10 .... HO ..—  ——— 





determine whether two arguments can be multiplied without causing overflow. 
Here is your solution: 


/* Determine whether arguments can be multiplied without overflow */ 
int tmult ok(int x, int y) { 

int p = x*y; 

/* Either x is zero, or dividing p by x gives y */ 

return !x || p/x == y; 


You test this code for a number of values of x and y, and it seems to work 
properly. Your coworker challenges you, saying, “If I can't use subtraction to 
test whether addition has overflowed (see Problem 2.31), then how can you use 
division to test whether multiplication has overflowed?" 

Devise a mathematical justification of your approach, along the following 
lines, First, argue that the case x = 0 is handled correctly. Otherwise, consider 
w-bit numbers x (x #0), y, p, and q, where p is the result of performing two's- 
complement multiplication on x and y, and q is the result of dividing p by x. 


1. Show that x - y, the integer product of x and y, can be written in the form 
x- y= p + t2", where t 4 Oif and only if the computation of p overflows. 


2. Show that p can be written in the form p = x -q +r, where |r| < |x|. 
3. Show that q = y if and only ifr =t — 0. 





lem 2.35) that uses the 64-bit precision of data type int64 t, withóut using 
division. 











ipráctice-Problem2:37/-Golutlónciade 4634. 55 Lacan dnl ce ada 

You are given the task of patching the vulnerability in the XDR code shown in 
the aside on page 100 for the case where both data types int and size t are 32 
bits. You decide to eliminate the possibility of the multiplication overflowing by 
computing the number of bytes to allocate using data type uint64 t. You replace 





Truncated x - y 











100 Chapter 2 Representing and Manipulating Information 


Aside Security vulnerability in the XDR library a 


In 2002, it was discovered that code supplied by Sun Microsystems to implement the XDR library, a 
widely used facility for sharing data structures between programs, had a security vulnerability arising 
from the fact that multiplication can overfiow without any notice being given to the program. 

Code similar to that containing the vulnerability is shown below: 


1 /* Illustration of code, vulnerability similar to that found in 

2 > Sun's XDR library. ; 

3 #/ 

4 void* copy_elements(void *ele src[], int ele cnt, size_t ele size) { 
5 /* 

6 * Allocate buffer for ele_cnt objects, each of ele_size bytes 
7 * and copy from locations designated by ele.src 

8 x/ 

9 void *re5ult = malloc(ele cnt * ele, size); 

10 if (result -- NULL) 

11 /* malloc failed */ 

12 return NULL; 

13 void *next = result; 

14 int i; + 

15 for (i = 0; i < efe cnt; i++) { 

16 /* Copy gbject i tọ destination */ 

17 memcpy (next, ele_src[i], ele_size); R 
18 /* Move póinter to next memory region */ 

19 next += ele size; 

20 } 

21 return result; 


22 } 

The function copy. elements is designed to copy ele. cnt data structures, each consisting of ele_ 
size bytes into a buffer allocated by the function on line 9. The number of bytes required is computed 
as ele. cnt * ele size. l 

Imagine, however, that a malicious programmer calls this function with ele. cnt being 1,048,577 
(279 + 1) and e1e. size being 4,096 (212) with the program compiled for 32 bits. Then the multiplication 
on line 9 will overflow, causing only 4,096 bytes to be allocated, rather than the 4,294,971,392 bytes 
required to hold that much data. The loop starting at line 15 wil] attempt to copy all of those bytes, 
overrunning the end of the allocated buffer, and therefore corrupting other data structures. This could 
cause the program to crash or otherwise misbehave. 

The Sun code was used by almost every operating system and.in such widely used programs as 
Internet Explorer and the Kerberos authentication system. The Computer Emergency Response Team 
(CERT), an organization run by the Carnegie Mellon Software Engineering Institute to track security 
vulnerabilities and breaches, issued advisory “CA-2002-25,” and many companies rushed to patch their 
l code. Fortunately, there were no reported security.breaches caused by this vulnerability. 

A similar vulnerability existed in many implementations of the libraty function calloc. These 
have since been patched. Unfortunately, many-programmers call allocation functions, such as malloc; 
using arithmetic expressions as arguments, without checking these expressions for overflow. Writing a 
reliable version of ca11oc is left as an exercise (Problem 2.76). 








Pied 


GA eee IF 


s 


à 


i 
| 
) 
t 


Section 2.3 Integer Arithmetic 


the original call to malloc (line 9) as follows: 


uint64_t asize = 
ele_cnt * (uint64_t) ele_size; 
void *result = malloc(asize); , 


Recall that'the argument to malloc has type'size, t. 


A. Does your code provide any improvement over the original? 


B: How would you change the code to eliminaté the vulnerability?’ 
un i 


e 


2.3.6 Multiplying by Constants 


Historically, the integer ‘multiply instruction ort many machines was fairly slow, 
requiring 10 or more clock cycles, whereas other integer operations—-such. as 
Addition, subtraction, bit-level operations, and shifting—required only 1'clock 
cycle. Even on the Intel Core i7 Haswell we use as our reference machine, integer 
multiply requires 3 clock cycles. As a consequence, one important optimization 
used by compilers is to attempt to replace multiplications by constant factors with 
combinations of shift-and addition operations, We will first consider the case of 
multiplying by a power of 2, and then we will generalize this to arbitrary constants. 


PRINCIPLE: Multiplication by a power of 2 


Let x be the unsigned integer represented by bit’ pattern [ty_1, x, 2, ..., xg]. 
Then for any k > 0, the w 4-&-bit unsigned’ representation of x2* is given by 
[Xu Xy; -- x9, 0, ..., 0], where k zeros have been added totke right. a 


H » 
So, for example, 11 can be represented for w = 4,as,[1011]. Shifting this left 
by k —2 yields the 6-bit vector [101100], which encodes the unsigned number 
11.4 = 44. 


DERIVATION: Multiplication by a power of 2 
This property can be derived using Equation 2.1: 


w-1 
B2U yt wt, Xw-2 ++ ++ x0 0... OD = D> x2* 
i=0 


w-1 
= b s2 | «ar 
i=0 


= x2k 
| 


When shifting left by k for a fixed word size, the high-order k bits are discarded, 
yielding 


Pi hs Xw—k-2» ees XQ, 0, uc cg 0] 


101 








but this is also the case when performing multiplication on fixed-size words. We 
can therefore see that shifting a value left is equivalent to performing unsigned 
multiplication by a power of 2: 


| | 102 Chapter 2 Representing and Manipulating Information 


PRINCIPLE: Unsigned multiplication by a power of 2 


For C variables x and k with unsigned values x and k, such that 0 < k < w, the C 

expression x << k yields the value x +", 2*. E 
Since the bit-level operation of fixed-size two’s-complement arithmetic is 

equivalent to that for unsigned arithmetic, we can make a similar statement about 

the relationship between left shifts and multiplication by a power of 2 for two’s- 

complement arithmetic: 

PRINCIPLE: 'Two's-complement multiplication by a power of 2 

For C variables x and k with two’s-complement value x and unsigned value k, such 

that 0 < k < w, the C expression x << k yields the value x * 2k, [| 


Note that multiplying by a power of 2 can cause overflow with either unsigned 
or two's-complement arithmetic. Our result shows that even then we will get the 
same effect by shifting. Returning to our earlier example, we shifted the 4-bit 
pattern [1011] (numeric value 11) left by two positions to get [101100] (numeric 

| value 44). Truncating this to 4 bits gives [1100] (numeric value 12 = 44 mod 16). 

Given that integer multiplication is more costly than shifting and adding, many 

i C compilers try to remove many cases where an integer is being multiplied by a 
| constant with combinations of shifting, adding, and subtracting. For example, sup- 
1 pose a program contains the expression x*14. Recognizing that 14 = 25 + 2? + 21, 
the compiler can rewrite the multiplication as (x<<3) + (x««2) + (x<<1), replac- 
| ing ofe multiplication with three shifts and two additions. The two computations 
will yield the same result, regardless of whether x is unsigned or two's comple- 
ment, and even if the multiplication would cause an overflow. Even better, the 
compiler can also use the property 14 — 2* — 2! to rewrite the multiplication as 
| (x««4) - (x<<1), requiring only two shifts and a subtraction. 
1 
L] 
k 





f the form (a<<k) + b, where k is either 0, 1, 2, or 3, and b is either 0 or some 

program value. The compiler often uses this instruction to perform multiplications 
by constant factors. For example, we can compute 3*a as (a<<1) + a. 

Considering cases where b is either 0 or equal to a, and all possible values of k, 

what multiples of a can be computed with a single LEA instruction? ' 


! Generalizing from our example, consider the task of generating code for 
the expression x * K, for some constant K. The compiler can express the binary 
representation of K as an alternating sequence of zeros and ones: ‘ 
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[0...0) (1...1) 0...0)--- (1... D] 


For example, 14 can be written as [(0 . . . 0) (111) (0)]. Consider a run of ones from 
bit position n down to bit position m (n > m). (For the case of 14, we have n =3 
and m = 1.) We can compute the effect of these bits on the product using either of 
two different forms: 


Form A: (x««n) + (x««(n — 1) ++ +- + (x<<m) 
Form B: (x«4(n +1)) - (x««m) 


By adding together the results for each run, we are able to compute x * K with- 
out any multiplications. Of course, the trade-off between using combinations of 
shifting! adding, and subtracting versüs a single multiplication instruction depends 
on the relative speeds of these instructions, and these can be highly machine de- 
pendent. Most compilers only perform this optimization when a small number of 
shifts, adds, and subtractions suffice. » 


Ta 





How coula we modify the expression | For fom B fer the case lere bit position n 
is the most significant bit? 








For cadi o the following Van of K, find ways to sais x* Ka dum oe UE 
specified number of operations, where we ‘consider both additions and subtrac- 
tions to have comparable cost. You may need to use some tricks beyond the simple 
form A:and B rules we have considered so far. 





_K  Shifts  ,Add/Subs ^ Expression 
6 2 1 
31 1 i iuc tod 
~6 2 1 Ee 
55 2 2 ‘ 





Fára arun Bo ones s staring at t bit Bosition } w down to "bit boston m Br > wae we saw 
that we cari generaté two forms of cde, A and B. How should'the compiler decide 
which form'to use? 





2.3.2 Dividing by Powers of 2 


Integer division on most machines is even slower than integer multiplication— 
requiring 30 or more clock cycles. Dividing by a power Of 2 can also be performed 
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| 
i k >> k (binary) Decimal 12,340/2* 
Ti 0 0011000000110100. 12,340 12,400 
| í 1 0001100000011010 6,170 6,170.0 
| 4  0000001100000041 771 771.25 
| 8 0000000000110000 48 48.203125 


Figure 2.28 Dividing unsigned numbers by powers of 2. The examples illustrate 
how performing a logical right shift by k has the same effect as-dividing by 2* and then 
A rounding toward zero. 


different right shifts—logical and arithmetic—serve this purpose for unsigned and 
two’s-complement numbers, respectively: 4 
Integer division always rounds toward zero. To define this precisely, let.us 
introduce some notation. For any real number a, define |a] to be the unique 
integer a’ such that a’ <a <a’ +1. As examples, |3. Iu 3, L-3. 14] = = —4, and 
i [3] = 3. Similarly, define [a] to be the unique integer a’ such that a' — 1 « a x a. 
As examples, [3314] = 4, [—3:14] 2 —3, and [312 3. For x 20 ‘and y > 0, integer 
division should yield |x/y], while for x < 0 and y > 0, it should yield [x/y]. That 
is, it should round down a positive result but round up a negative one. 
i The case for using shifts with unsigned arithmetic is straightforward, in part 
because right shifting is guaranteed to be performed logically for unsigned values. 
pu 0 
PRINCIPLE: Unsigned division by a power of 2 


For C váriables x and‘k with unsigtiéd values x and k, such that 0x &'« w, the C 
i expression x >> k yields the value |x/2* j. a 


| 
| 
using shift operations, but.we use a right shift rather than-a left shift. The two 


As examples, Figure 2.28 shows the effects of performing logical right shifts 

on a 16-bit representation of 12,340 to perform division by 1, 2, 16, and 256. The 

i zeros shifted in from the left are shown in italics. We also show the result we would 
obtain if we did these divisions with real arithmetic. These examples show that the 

i result of shifting consistently rounds toward zero, as is the convention for integer: 

|o division. 

i 


DERIVATION: Unsigned division by a power of 2, 


Let x be the unsigned integer represented-by-bit, pattern [x 4, Xy 2. ~- > Xo] and 
let k be in the range 0 x k « w. Let x’ be the unsigned number with w — k-bit 
i representation [Xu 15 Xw—2 +--+» xg], and let x” be the unsigned number with. k-bit 
i representation [x,..,, ..., xg]. We can therefore see that x = 2*x’ + x”, and that 
zi 0 x x" < 2*. It therefore follows that |x/2* | =x’. 

Performing a logical right shift of bit vector [x,,_ 1u5521 4 , Xo] by k yields 
the bit vector 


[0, .. SOR Xw-2» ovs Xg] 
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k >> k (binary) Decimal —12,340/2* 

0 1100111111001100 —12,340 —12,340.0 

1 1110011111100110 —6,170 —6,170.0 

4 1111110011111100 ~772 —771.25 

8 1111111111001111 —49 —48.203125 


Figure 2.29, Applying arithmetic right shift. The examples illustrate ‘that arithmetic 
right shift is similar to division by:a power of 25 except that it rounds down rather than 
toward zero. 


This bit vector has numeric value x’, which we have seen is the value that would 
result by computing the expression x >> k. E 


The case for dividing by a power of 2 with two’s-complement arithmetic is 
slightly more complex. First, the shifting should be performed using an arithmetic 
right shift, to ensure that negative valués remain negative. Let us investigate what 
value such a right shift would produce. 


PRINCIPLE; "Iwo's-complement division by a power of 2, rounding down 
A 


Let C variables x and k have two's-complement: value x and unsigned value 
k, respectively, such, that 0, k*« w. The C expression x >> k, when the shift is 
performed arithmetically, yields the.xalue (x /2* |. E 


For x, > 0, variable x has 0 as the moét’ significant bit, and so,the effect of an 
arithmetic shift is the same as for a logical right shift. Thus, an arithmetic right shift 
by k is the same as division by 2* for a nonnegative number. As an example of a 
negative number, Figure 2.29 shows the effect of applying arithmetic right shift to 
a 16;bit representation of —12,340 for different shift amounts. For the case when 
no rounding is required (k = 1), thé result will be x /2*. When rounding is required, 
shifting causesthe result to be rounded ‘downward. For example, the shifting right 
by four has the effect of rounding —771.25 down to —772. We will need to adjust 
our strategy to handle division for negative values of x. 


DERIVATION: Two's-complement division by a power of 2, rounding down 
Let x be the two's-complement integer represented by bit pattern Pacis Xy—2 
-» Xo], and let k be in' the range 0 <k < w. Let x’ be the two’s-complement 
number represented by the w — k bits [xy_1, Xp_2, ... , Xy], and let x” be the 
uhsigned number represented by the low-order k Bits [x¢_1, - - - , x9]. By a similar 
analysis as the unsigned case, we have x = 2*x' + x" and 0 < x" < 2*. giving x’ = 
[x/2* |. Furthermore, observe that shifting bit vector [x,_1, xy. 2, .. ., xo] right 
arithmetically by k yields the bit vector 


[xu-1; tos Xw—ls Xw—1s X2 ess, xy] 


which is the sign extension from w — k bits to w bits of [Swi X2; -.., xg]. Thus, 
this shifted bit vector is the two's-complement representation of |x/2* J. E 








- e —M ee T E EE d ————— 
CÓ d CHRIS = J 
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| ME k Bias  —12,340 bias (binary) >> k (binary) Decimal —12,340/2* 
| i 0 0 1100111111001100 1100111111001100 —12,340 —12,340.0 
1 1 1100111111001102 1110011111100110 —6,170 —6,170.0 
4 15 1100111111011011 1111110011111101 —771 —771.25 
1 | 8 255 1101000011001011 1111111111010000 —48 —48.203125 
M i 


Figure 2.30 Dividing two's-complement numbers by powers of 2. By adding a bias 
| | before the right shift, the result is roanded toward zero. 

I 

I 


We can correct for the improper rounding that occurs when a negative number 
is shifted right by “biasing” the value before shifting. 
PRINCIPLE: Two’s-complement division by a power of 2, rounding up 


Let C variables x and k have two’s- -complement value x x and ‘unsigned value k, 
respectively, such that < k < w. Thé Č expression Qr (à «c X) - 1) >> k, when 


i the shift is performed arithmetically, yields the value. [x/2*]. a 
| Figure 2.30 demonstrates how adding the appropriate bias before performing 
the arithmetic right shift causes thé result to be correctly rounded. In the third 


* column, we show the result of adding.the bias value to —12540, with the lower k 
bits (those that will be shifted off to the right) shown in italics. We can see that 
| the bits to the left of these may or may not be incremented. For the casé where no 
| rounding is required (k = 1), adding the bias only affects bits that are shifted off. 

For the cases where rounding is required, adding the bias causes the upper bits to 





! be incremented, so that the result will be rounded toward zero. 
The biasing techhique exploits: the property that [x/yl = lx +y- D/yJ for 
i integers x and y such that j x 0. As examples, when x = —30 and y —4, we have 


x+y—1=—-27and [- 30/4 = -—]- aie 27/4). When x = —32 and y =4, we ‘have 
i x + y — 1= -29 and [—32/4] = —8 = |-29/4]. 


DERIVATION: Two’s-complement division by a power of 2, rounding up 


To see that [x/y] = L(x + y — 1)/y], suppose that x — qy +r, where 0 € r <y, 
giving (x + y — D/y ='4 + (r + y — 1)/y, and so L(x +y— D/yl a *- trt y — 
1)/ y]. The latter term will equal 0 when r = 0 and 1 when r > 0. That is, by adding 
a bias of y — 1 to x and then rounding the division downward, we will get g when 
y divides x and q + 1 otherwise. 
Returning to the case where y — 2*, the C expression x +(1<<k) -1 yields 
! the value x +,2* — 1. Shifting this right arithmetically by k therefore yields [x/ 2H; 
" 


ems te 


1 


r These analyses show that for a two's-complement machine using arithmetic 
: right shifts, the C expression : 


B (x«0 ? x+(1<<k)-1 : x) > K 


will compute the value x/2*. 2 | 
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Write a ae divié litt returns the valie x/1 16 for integer aeure: x. Your 
function should not use division, modulus, multiplication, any conditionals (if or 
?:), any comparison operators (e.g., <, >, or ==), or any loops, You may assume 
that data type int is 32 bits long and uses a two’s-complement representation, and 
that right shifts are performed arithmetically. 


We now see that division by a power of 2 can be implemented using logical < or 
arithmetic right shifts. This is precisely the reason the two types of right shifts are 
available on most machines. Unfortunately, this approach does not generalize to 
division by arbitrary constants. Unlike multiplication, we cannot express division 
by arbitrary constants K in terms of division by powers of 2. 


In the following codes we shave omilted the definitions ota constants M jand N: 


#define M /* Mystery number 1 */ 

#define N /* Mystery number 2 */ 

int arith(int x, int y) ( 
int result = 0; 
result = x*M + y/N; /* M and N are mystery numbers. */ 
return result; 


We compiled this code for particular values of M and N. The compiler opti- 
mized the multiplication and division using the methods we haye discussed. The 
following is a translation of the generated machine code back into C: 


/* Translation of assembly code for arith */ s 
int optarith(int x, int y) ( 

int t = 

x <<= 5; 

x —= t; 

if (y < 0) y += 7; 

y >>= 3; /* Arithmetic shift */ 

return x+y; 


t 
What are the values of M and N? 


2.3.8 Final Thoughts on Integer Arithmetic 


As we have seen, the “integer” arithmetic performed by computers is really 
a form of modular arithmetic. The finite word size used to represent numbers 
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limits the range of possible values, and the resulting operations can overflow. 
We have also seen that the two’s-complement representation provides a clever 
way to represent both, negative and positive values, while using the same bit- level 
implementations as are used to perform unsigned arithmetic—operations such as 
addition, subtraction, multiplication, and even division pave either identical or 
very similar bit-level behaviors” whether the operands’ are in unsignéd or two's- 
complement form. 
We have seen that some of the conventions in the C language can yield some 
su rising results, and these can be sources of bugs that are hard to recognize or 
ünderstand. We have especially seen that the unsigned data type, while conceptu- 
ally ‘straightforward, can lead to behaviot* that even experienced programmers do 
not expect. We have also seen tHat this data type can arise in unexpected ways—for 
example, when writing integer constants'and when invoking library routines. 





Assume data bss int is 32 bits long and uses a 1 two! s-complement representation 
for signed values, Right shifts are performed arithmetically for signed values and 
logically for unsigned values. The variables are declared and initialized as follows: 


p 
int x = fooQ; /* Arbitrary value */ 
int y = barO; /* Arbitrary value */ 


unsigned ux = x; 
unsigned uy = y; 


For each of the following C expressions, either (1) argue that it is true "(evalu- 
ates to 1) for all values of x arid y;ór (2) give values of x ànd y for which it is false 
(evaluates toJ0): ' ) 


A. (x> 0) |] (x-1«0) 
(x & 7) !=7 || (x««29 < 0) 


| 

. (x *x) >=0 | 
. x<0 || -x<=0 
x> 0 || -x>=0 


| 


x+y == uytux 


Qmumunonst 


X*-y + uy*ux == -xX 


2.4 Floating Point 


3! 
: : A ; : nF 
i A floating-point representation encodes rational numbers ofthe form V =x x 2. 
It'is useful for performing computations involving very large numbers {|V | > 0), 
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i The Institute of Electrical’ and, Electronics Engineers ; (IEEE— pronounced *eye-triple-£e") isa prot’; 
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numbers very close to 0 (|V| < 1), and more generally as an approximation to real 
arithmetic. 

Up until the 1980s, every computer manufacturer devised its own conventions 
for how floating-point numbers were represented and the details of the operations 
performed on them. In addition, they often did not worry too much about the 
accuracy of the operations, viewing speed and ease of implementation as being 
more critical than numerical precision. 

All of this changed around 1985 with the advent of IEEE Standard 754, a 
carefully.crafted standard for representing floating-point numbers and the oper- 
ations performed on them. This effort started in 1976 under Intel’s sponsorship 
with the design of the 8087, a chip that provided floating-point support for the 8086 
processor. Intel hired William Kahan, a professor at the University of California, 
Berkeley, as a consultant to help design a floating-point standard for its future 
processors. They allowed Kahan to join forces with a committee generating an 
industry-wide standard under the auspices of the Institute of Electrical and Elec- 
tronics Engineers (TEEE). The committee ultimately adopted a standard close to 
the one Kahan had devised for Intel. Nowadays, virtually all computers support 
what has become known as JEEE floating point. This has greatly improved the 
portability of scientific application programs across different machines. 

In this section, we will see how numbers are represented in the IEEE floating- 
point format. We will also explore issues of rounding, when a number cannot be 
represented exactly in the format and hence must be adjusted upward or down- 
ward. We will then explore the mathematical properties of addition, multiplica- 
tion, and relational operators. Many programmers consider floating point to be 
at best uninteresting and at worst arcane and incomprehensible. We will see that 
since the IEEE format is based on a small and consistent set of principles, it is 
really quite elegant and understandable. 


2.4.1 Fractional Binary Numbers 


A first step in understanding floating-point numbers is to consider binary numbers 
having fractional values. Let us first examine the more familiar decimal notation. 
Decimal notation uses a representation of the form 


dy d, 1: dido did 2: d, 
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Figure 2.31 

Fractional binary 
representation. Digits 

to the left of the binary 
point have weights of the 
form 2, while those to the 
right have weights of the 
form 1/2. 


1/277 
1/2" 


where each decimal digit d; ranges between 0 and 9. This notation represents a 
i i 1 


value d défined as! 


3» 
^ 


m 
d= 710 x d, 


i=—n 


The weighting of'the digits is defined relative to’ the' decimal point symbol (5:5, 
meaning that digits to the left are weighted by nonnegative powers of 10, giving 
integral values, while digits to the right are weighted by negative powers of 40, 
giving fractional values. For example, 12.3419 represents the number 1 x 10'+ 
2x10? +3 x 1071-4 x 107? = 1245. 

By analogy, consider a notation of the form 


à 


bu bm-1 wae bi bga b qb oa bingt b-n 


) 
where each-binary digit; or bit, b; ranges between O and 1, as is illustrated in 
Figure 2.31. This notation represents a number b definedas + 


m 
bz UX bj f (2.19) 


i=—n 


The symbol * now becomes a binary point, with bits on the left being weighted 
by nonnegative powers of 2, and those on the right being weighted by negative 
powers, of 2. For example, 101.11,,represents the number 1 x 22. 0x21 - 1x 
2.1x24341x23-440-1-244-25$3 7 

One can readily see from Equation 2.19 that shifting the binary point one 
position to the left-has theseffect of dividing the number by 2. For example, while 
101.11, represents the number 53, 10.111, represents the number 2 +0 + i + 
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1 + i 22i. Similarly; shifting the binary point one position to the right has the 
effect of multiplying the number by 2. For example, 1011.15 tepresents the number 
8042414211. 

‘Note that numbers of the form 0.11 - - 15 represent numbers just below 1. For 
example, 0.111111, represents S. We will use the shorthand-notation 1.0 — e to 
représent such values. 1 

Assuming we consider only finite-length encodings, decimal notation cannot 
represent numbers such as i and 3 exactly. Similarly, fractional binary notation 
can only represent numbers-that can be written x x 2*. Other values can only be 
approximated. For exàmple, the number i can be represented exactly as the frac- 
tional decimal number 0.20. As a fráctional.binary number, however, we cannot 
represent it exactly and instead must approximate it with increasing accuracy by 
lengthening the binary representation: 


Representation Value Decimal 


0.0» ? 0.019 

0.01, i 0.2549 

0.010; $^ 0555 

0.00115 E 0.187519 

0.00110, E 0.187549 

0.001101, is 0.20312549 

0.0011010; n 0.203125;9 

0.001100115 ey 0.19921875, 
256 0 





Fill in the missing information'in the following table: 





Fractional value Binary representation Decimal representation 





i 0.001 0.125 

3 š 

3 TOP e 

5 

16 EEEE eea 

rei 10.1011 gets: 
" 1.001 

estes 5.875 
MR 3.1875 








ET AN ET T E EIUS UE] 
The imprecision of floating-point arithmetic can have disastrous effects. On Febru- 
ary 25, 1991, during the first Gulf War, an American Patriot Missile battery in 
Dharan, Saudi Arabia, failed to intercept ah incoming’ Iraqi’ Scud missile. The 


Scud struck an American Army barracks and killed 28 soldiers The US General 
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Accounting Office (GAO) conducted a detailed analysis of the failure [76] and de- 
termined that the underlying cause.was an imprecision in 4 numeric calculation. 
In this exercise, you will reproduce part of the GAO’s analysis. 

The Patriot system contains an internal clock, implemented.as a counter 
that is incremented every,0.1 seconds. To determine the time in seconds; the 
program would multiply the value of this counter by a 24-bit quantity that was 
a actione binary approximation to 45. In particular, the binary representation 
of + qg is the nonterminating sequence 0. "000110011001 1]- + +2, where the portion in 
brackets 1s repeated indefinitely. The program approximated 0.1, as a value x, by 
considering just the first 23 bits of.the sequence to the right-of the binary. point: 
x = 0.00011001100110011001100. (See Problem 2.51 for a discussion of how they 
could have approximated 0.1 more. precisely.) 2 


A. What is the binary representation of 061—492 
B. What is the approximate decimal value of 0.1 — x? s 


C. The clock starts at 0 when the system is first powered up and keeps counting 
up from there. In this case, the system had been running for around 100 hours. 
What was the difference between the actual time and the time computed by 
the software? 


D. The system predicts where an incoming missile will appear based on its 
velocity and the time of the last radar deteclibn. Given that a Scud travels 
at around 2,000 meters per second, how far off was its prediction? 


Normally, a slight error in the absolute time reported by a clock reading would 
not affect a tracking computation. Instead, it should depend on the relative time 
between two successive readings. The problem was that the Patriot software had 
been upgraded to use a more accurate function for reading time, but not all of 
the function calls had been replaced by the new code. As a result, the tracking 
software used the accurate time for one reading ahd' the inaccurate time for the 
other [103]. : " 


2.4.2 IEEE Floating-Point Representation 


Positional notation such as considered in the previous section would not be ef- 
ficient for representing very large numbers. For example, the representation of 
5 x 210 would consist of the bit pattern 101 followed by 100 zeros. Instead, we 
would like to represent numibers in a form x x 2” by giving the values of x and y. 

The IEEE floating-pointstandard represents a number in a form V = (—1)* x 
M x2^*: 


* The sign s determines whether the number is negative (s — 1) or positive 
(s= = 0), where the interpretation of the sign, bit for numeric value 0 is handled 
as a special.case. 

* The.significand, M is.a fractional binary number that ranges either between 1 
and 2 — e or between fand 1 — e. ] 

* The exponent E weights the value by a (possibly negative) power of 2. 
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Single precision 
31 30 2322 n O 





Double precision 
63 62 52 51 32 


. i frac (51 :32) i 


31 0 


Figure 2.32 Standard floating-point formats. Floating-point numbers are represented 
by three fields. For the two most common formats, these are packed in 32-bit (single- 
precision) or 64-bit (double-precision) words. 





The bit representation of a floating-point number is divided into three fields to 
encode these values: 


* The single sign bit s directly encodes the sign s. 
*. The k-bit exponent field exp = ez..1 - - - eqeg encodes the exponent E. 
* The n-bit fraction field frac = f, ;--- f; fo encodes the significand M, but 


the value encoded also depends on whether or not the exponent field equals 
0. 


Figure 2.32 shows, the packing of these three, fields into words for the two 
most common formats. In the single-precision floating-point: format: (a float 
in C), fields s, exp, and frac are 1, k,= 8, and n = 23 bits each, yielding, a.32- 
bit representation.-In the double-precisign floating-point format (a double in C), 
fields's, exp, and frac are 1, k =11,-and n = 52 bits each, yielding a 64-bit 
representation. 

The value encoded by a given bit representation can be divided into three 
different cases (the latter having two variants), depending on the value of exp. 
These are illustrated in Figure 2.33 for the single-precision format. 


Case 1: Normalized Values 


This is the most common case. It.occurs when the bit pattern of exp is neither 
all zeros (numeric value 0) nor all ones (numeric value 255 for single precision, 
2047 for double). In this case, the exponent field is interpreted as representing a 
signed integer in biased form. That is, the exponent value is E — e — Bias, where 
e is the unsigned number having bit representation ej. ; : - - ejeg and Bias is a bias 
value equal to 2*71 — 1 (127 for single precision and 1023 for double). This yields 
exponent ranges from —126 to +127 for single precision and —1022 to +1023 for 
double precision. 

The fraction field frac is interpreted as representing the fractional value f, 
where 0 x f <1, having binary representation 0. f, 1--- fifo, that is, with the 
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1. Normalized 





| | Figure 2.33 Categories of singlé-precision floating- -point values. The value of the 
exponent determines whether the number is (1) normalized, (2) denormalized, or (3) a 
special value. a à 


| binary point to the left of the rhost significant bit. The significand is defined to be 
M =1+ f. Thisás Sometimes called an implied leading'l representation, because 
we catt view M to be.the number with binary representation 1. f, 1f, 2- = fot This 
representation is a trick for getting an additional bit of precision for free, since we 
can always adjust the exponent E so that significand M is in the range <M <2 
(assuming there is no overflow). We therefore do not need to explicitly. represent 
the leading bit, since it always equals 1. 


Case 2: Denortrialized Values 


When the exponent field is all zeros, the represented.number is in dexormalized 
form. In this case, the exponent value is E = 1 — Bias, and the significand value is 
M = f, that is, the value of the fraction fiel? withóut an implied leadiríg4. 

Denorrhalized numbers serve two purposes. First, they providé a way 'to 
represent numeric valwé 0, since with a rtióriíálized number we must always have 
M > 1, and hence we cannot répresent 0. In fact, the floating-point representation 
of +0.0 has a bit pattern of all zerós: the sign'bit'is 0, tlie exponent field is'all 
zeros (indicating a denormalizéd value), and the fraction field is all zeros,giving 
Mi= f — 0. Curiously, when the sign bit is 1, but the other fields are all zeros, we 
get the value —0.0. With IEEE floating-point format, the values —0.0' and +0.0 
are considered different ih some ways and the same in others. 
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A second function of denormalized numbers is to represent numbers that are 
very close to 0.0. They provide a property known as gradual underflow in which 
possible numeric values are spaced evenly near 0.0. 


Case 3: Special Values 


A final category of values occurs when the exponent field is all ones. When the 
fraction field is all zeros, the resulting values represent infinity, either +oo when 
s =0 or —oo when s = 1. Infinity can represent results that overflow, as when we 
multiply two very large numbers, or when we divide by zero. When the fraction 
field is nonzero, the resulting value is called a NaN, short for *not a number." Such 
values are returned as the result of an operation where the result cannot be given 
as a real number or as infinity, as when computing 4/—1 or oo — oo. They can also 
be useful in some applications for representing uninitialized data. 


2.4.3 Example Numbers 


Figure 2.34 shows the set of values that can be represented in a hypothetical 6-bit 
format having k = 3 exponent bits and n = 2 fraction bits. The bias is 2271 — 1 = 3. 
Part (a) of the figure shows all representable values (other than NaN). The two 
infinities are at the extreme ends. The normalized numbers with maximum mag- 
nitude are +14. The denormalized numbers are clustered around 0. These can be 
seen more clearly in part (b) of the figure, where we show just the numbers be- 
tween —1.0 and +1.0. The two zeros are special cases of denormalized numbers. 
Observe that the representable numbers are not uniformly distributed—they are 
denser nearer the origin. 

Figure 2.35 shows some examples for a hypothetical 8-bit floating-point for- 
mat having k = 4 exponent bits and n = 3 fraction bits. The bias is 2771 — 1 — 7. 
The figure is divided into three regions representing the three classes of numbers. 
The different columns show how the exponent field encodes the exponent E, 
while the fraction field encodes the significand M, and together they form the 


—% —10 -5 0 +5 +10 +00 


e Denormalized «Normalized a Infinity 


(a) Complete range 


-Q +0 
NZ 
$B ne 
=] -0.8 -06  -04 -0.2 0 *0.2 +04 +06 +08 +1 


e Denormalized «Normalized n Infini 


(b) Values between —1.0 and +1.0 


Figure 2.34 Representable values for 6-bit floating-point format. There are k =3 
exponent bits and n — 2 fraction bits. The bias is 3. 
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| Exponent '' Fraction Value 
| Description Bit representation e E 2% “f M 25xM V Decimal 
Zero 0 0000 000 o -6 à ? ? as 0 0.0 
iti 1 1 1 1 1 
2 Smallest positive 0 0000 001 0 -6 à i i E 3E 0.001953: 
L ! 0 0000 010 0 6 d ài 2 E x 0.003906 
| | 0 0000 011 0 -6 4à&à-3!' i cm Ys 0.005859 
Largest denormalized 0 0000 111 0 6 d.i t o si 0.013672 
ft 1 be r 
Smallest normalized 0 0001000 1-6 & f Po & — onse 
1 1 9 9 9 
| | 0 0001 001 1 -6 d id $ sh se 0017578 
d 1 6 14 14 7 
i 0 0110 110 6 -1 1 $ H 4 A 0.875 
0 0110 111 6-1 Lk 4 E E B 0935253 
|o] One 0 0111 000 7 01 f 8 g 1 10 4 
0 0111 Q01 7 0 1 1 i $ H 1.125 
|! 0 0111 010 7 0 1 4 R B i 125 
O'#110 110 14 7 18 $ P 24 224.0 
: Largest normalized 0 1110 111 l4 T 1288 3 P ' 240 2400 
Infinity 0 1111000 - - - — & — Oo — 
| [E 3 
: Figure 235 Example nonnegative-values for 8-bit floating-point format. There are k = 4 exponent bits 
and n = 3 fraction bits. The bias is 7. io 
3 I * 





represented value V — 2 x M. Closest to 0 are the denormalized numbers, start- 
ing with 0 itself. Denormalized numbers jn this format have E = 1.— 7 = —6, giv- 
ing a weight 27 = a: The fractions f and signjficands, M range over the values 
0, i PEE un giving numbers V in the range 0 to a x i = 35 

The smallest normalized numbers in this format also have E — 1 — 7 — —6, 
and the fractions also range over the values 0, i eas un However, the significands 
then range from 1--0—11014- 7 — D, giving numbers V in the range = a 


15 
| to 5f2- 





Observe the smooth transition lic ps the largest denormalized number zl 
and the smallest normalized number zÈ si? This smoothness'is due to our definition 
of E for denormalized values, By making it 1 — Bias rather than — Bias, we com- 

' pensate for the fact that the significand of a denormalized nunibér does not have 

an implied leading 1. 
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As we increase the exponent, we get successively larger normalized values, 
passing through 1.0 and then to the largest normalized number. This number has 
exponent E — 7, giving a weight 27 = 128. The fraction equals 7, giving a signifi- 
cand M — 5, Thus, the numeric value is V = 240. Going beyond this overflows to 
+00, 

One interesting property of this representation is that if we interpret the bit 
representations of the values in Figure 2.35 as unsigned integers, they occur in 
ascending order, as do the values they represent as floating-point numbers. This is 
no accident—the IEEE format was designed so that floating-point numbers could 
be sorted using an integer sorting routine. A minor difficulty occurs when dealing 
with negative numbers, since they have a leading 1 and occur in descending order, 
but this can be overcome without requiring floating-point operations to perform 
comparisons (see Problem 2.84). 





Consider a 5-bit floating-point representation based on the IEEE floating-point 
format; with one sign bit, two exponent bits (k = 2), and two fraction bits (n = 2). 
The exponent bias is 227! — 1 — 1. 

The table that follows enumerates the entire nonnegative range for this 5-bit 
floating-point representation. Fill in the blank table entries using the following 
directions: 





€: The value represented by considering the exponent field to be:an unsigned 
integer 
E: The value of the exponent after biasing 
2*: The numeric weight of the exponent 
f: The value of the fraction 
"M : The value of the significand 
27 x M: The (unreduced) fractional value of the number 
V: The reduced fractional value of the number 
Decimal: The decimal representation of the number 
Express the values of 27, f, M, 2* x M, and V either as integers (when 


possible) ‘or as fractions of the form » where y is à power of 2. You need not 
fill in entries marked —. 


Bits e E 2E f M 2E x M V Decimal 


117 





0 00 00 LI — CERE Hn 
0 00 01 ——— MT DOT WERT 
0 00 10 mie Loud eek 
0 00 11 RNC thas: RR 
0 01 00 Ia eus PER 


HAT 
HA 
HA 
HAT 
HAT 























1 5 5. 5 
0 0101 1 0 1 i à 4 1 
1 
Bits e E, Qe nx» d M 2F-x M V Decimal 
0 01 10 duet VILIA LIAC. eee: up LA aenea 
00111 CE — n NS mist — HITS ges 
01000» "22. ces cee ccu ue el ee gee iure (usi 
01001 Eure uU I lor -— VETE, ms E. 
OMG c. gic. Ge. tees “Se a ee — Urrdiei 
01011 a E DS 2 
0 11 00 = == = = — — E —, 
0 11 01 — — — — — — z - 
0 11 10 — — — — — — ERU — 
01111 — — — - — = wo = 
Figure 2. 36 shows the represéntations ‘and hunieric values of some important 
single- ‘and’ dóüble- -precisión fioating-point numbers. As with the 8-bit forniat 
shown in Figure 2.35, we can see some general properties for a floating-point 
representation with a k-bit exponent and an A-bit fraction: `} 
* The value +0.0 always has a bit representation of al] zeros. 
, * The smallest positive denormalized value has a bit representation consisting of 
a 1 in the least significant bit position and otherwise all zeros. It has a fraction 
(and significand) value M = f —2^" and an exponent value E = —2*-1 4- 2. 
The numeric value is therefore V = 277-2742. 
* The largest denormalized value has abit representation consisting of an 
exponent field of all zeros and a fraction field of all ones, It has a'fraction 
(and significand) value M = f =1—2™ (which we have written 1 — €) and 
an exponent value E — —2*-1 4- 2. The numeric value is therefore V — ü — 
2-") x 2-232. which is just slightly smaller: than the smallest normalized 
value. : 
t 
Single precision Double precision 
Description exp frac: Value Decimal Value Decimal 
Zero 00...00 0---00 0 0.0 - 0 0.0 ' 
Smallest denormalized 00...00 0...01 273x276 14x10 2752 x27102 4.9 x 197324 
Largest denormalized 00---00 1--.11 (—e€) «2716 12x107 (l—e) x27! 2.2 x 10°38 
Smallest normalized — 00...01 0---00 1x27 12x108 1x 27122 22310790 
One 01-..11 0---00 1x2 1.0 1x2 1.0 
Largest normalized 11---10 1.11 (Q-609x2?7' 34x108 (2-6) x2! 1.8 x 10908 


Figure 2.36 Examples of nonnegative floating-point numbers. 
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* The smallest positive normalized value has a bit representation with a 1 in 
the least significant bit of the exponent field and otherwise all zeros. It has'a 
significand value M — 1 and ah exponent value E = —2*-! + 2. The numeric 


k- 
value is therefore V = 23 iis 


* The value 1.0 has a bit representation with all but the most significant bit of 
the exponent field equal to 1 and all other bits equal to 0. Its significand value 
is M = 1and its exponent value is E = 0. 


* The largest normalized value has a bit representation with a sign bit of 0, the 
least significant bit of the exponent equal to 0, and all other bits equal to 1. It 

, hasafraction value of f = 1 — 2^", giving a significand M —2 — 2" (which we 
have written 2 — e.) It has an exponent value E = 2*7! — 1, giving a numeric 
value V = (2 — 27”) x 277-12 — 2-7-2) x 22^. 


One useful exercise for understanding floating-point representations is to con- 
vert sample integer vilues into floating-point form. For example, we saw in Figure 
2.15 that 12,345 has binary representation [11000000111001]. We create a normal- 
ized representation of this by shifting 13 positions to the right of a binary point, 
giving 12,345 = 1.1000000111001, x 213. To encode this in IEEE single-precision 
format, we»construct the fraction field by dropping the leaditig 1 and adding 10 
zeros to the end, giving binary representation [10000001110010000000000]. To 
construct the' exponent field, we add bias 127 to 13, giving 140, which has bi- 
nary representation [10001100]. We combine this with a sign bit of 0 to get the 
floating-point representation'in binary of [01000110010000001110010000000000]. 
Recall from Sectidn 2.1.3 that we observed the following correlation in the bit- 
level representations of the integer value 12345 (0x3039) and the single-precision 
floating-point value 12345 . 0 (0x4640E400): 


.0 0 Or0 3 0 3 9 
00000000000000000011000000111001 
skokokokokokalokokeololek 
4 6 4 0 E 4 0 O 
01000110010000001110010000000000 


We, can now see that the region of correlation corresponds to the low-order 
bits of the integer, stopping just before the most significant bit equal to 1 (this bit 
forms the implied leading 1), matching the high-order bits in the fraction part of 
the floating-point representation. 





As Wieitoned in Problem 2. 6, the miezer 3, 510, 593 has iesadecimal represen- 
tation 0x00359141, while the single-precision floating-point number 3/510,593.0 
has hexadecimal representation 0x44564504. Derive this floating-point represen- 
tation and explain the correlation between the bits of the integer and floating-point 
representations. 
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A. For a  floatitif point mam with a „an n-bit fractión,, give,a tonal for the 
; smallesť positive integer that ‘cannot“be represerited exactly (because it 
5] would require an (n + 1)-bit fraction to be exact). Assume the exponent 
field size k is large enough that the range of representable exponents does 

not-provide a limitation for this problem. 


B. What is the numeric value of this integer for single-pretision format US = 
23)? 





2.4.4 Rounding 


Floating-point arithmetic can only approximate real arithmetic, since the repre- 
sentation has limited range-and precision. Thus, for a value: x. we generally want 
a systematic method of findingsthe, “closest” matching value x; that can be rep- 
resented in the desired floating-point format. This is the task of the. rounding 
operation. One key problem is to define the direction to round a value that is 
halfway between two possibilities. For example, if I have $1.50 and want to round 
it to the nearest dollar, should the result be $1 or $2?.An alternative, approach is 
to maintain a lowen and an upper bound on, the actual number. For example, we 
could determine representable values x^ and x* such that the value x is guaran- 
teed to lie between them: x7 < x x x*. The JEEE flgating:point, format defines 
four different rounding:modes."The:defau]t. method finds a closest match, while 
the other three can be used.for computing upper and lower, bounds. 

Figure 2.37 ijlustrates-the four rounding modes applied. to the problem of 
rounding a monetary amount to the nearest whole dollar. Round-toreven (also 
called round-to-nearest) is the default mode. It attempts to find a closest match. 
Thus, it rounds $1.40 to $1 and $1.60 to $2; since these are the closest whole dollar 
values. The only design decision is to determine the effect of rounding values 
that are halfway between two possible results. Round-to-even mode adopts the 
convention that it rounds the number either upward or downward such that the 
| least significant digit of the result is even: Thus, it rounds both $1.50 and $2.50 
! to $2. 
i The other three modes'produce guaranteed bounds on the actual value. These 
l | can be useful’ in some numerical applicatións:-Round-toward-zero mode rounds 
positive numbers downward and negative numbers upward, giving a value ê such 

i 


SEEDS O EE REPRE 





à Mode $1.40 $1.60 $150 $250  $-150 


Round-to-eyen $34, $2 $2 $2 $2 
Round-teward-zero $1 $1 $1 $2 $-1 
Round-down al $l $1 $1 $2 : $2 


Round-up $2 $2 $2 $3 $31 


Figure 2.37 Illustration of rounding modes for dollar rounding. The first rounds to 
a nearest value, while the other three bound the result above or below. 
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that |£| x |x|. Round-down mode rounds both positive and negative numbers 
downward, giving a value x~ such that x~ <x. Round-up mode rounds both 
positive and negative numbers upward, giving a value x+ such that x x x*. 

‘Round-to-even at first seems like it hgsa rather arbitrary goal—why is there 
any reason to prefer even numbers? Why not consistently round values halfway 
between two representable values upward? The problem with such a convention 
is that one can easily imagine scenarios in which rounding a set of data values 
would then introduce a statistical bias into the computation of an average of the 
values. The average of a set of numbers that we rounded by this means would 
be slightly higher than the average of the numbers themselves. Conversely, if we 
always rounded numbers halfway between downward, the average of a set of 
rounded numbers-would be slightly lower than the average of the numbers them- 
selves. Rounding toward even numbers avoids this statistical bias in most real-life 
situations. It will round upward about 50% of the time and round downward about 
50% of the time. 

Round-to-even rounding can be applied even when we are not rounding to 
a whole number. We simply consider whether, the least significant digit is even 
or odd. For example, suppose we want to round decimal numbers to the nearest 
hundredth. We would round 1.2349999 to 1.23 and 1.2350001 to 1.24, regardless 
of rounding mode, since they are not halfway between 1.23 and 1.24. On the other 
hand, we would round both 1.2350000 and 1.2450000 to 1.24, since 4 is even. 

Similarly, round-to-evén rounding can be applied to binary fractional num- 
bers. We consider least significant bit value 0:to'be even and 1 to be odd. In 
general, the rounding mode is only significant when we have a bit pattern of the 
form XX -.- X.YY --- ¥100---, where X and Y denote arbitrary bit values with 
the rightmost Y being the position to which we wish to round. Only bit patterns 
of this form denote values that are halfway between two possible results. As ex- 
amples, consider the próblem of rounding values to the nearest quarter (i.e., 2 bits 
to the right of the binary point.) We would round 10.00011; (23 ) down to 10.00, 
(2), and 10.00110, (23) up to 10.01, Qi ), because these values are not halfway 
between two possible values. We would round 10.11100, Qi ) up to 11.00; (3) and 
10.101005 (23) down to 10.10; (23), since these values are halfway between two 
possible results, and we prefer to have the least significant bit equal to zero. 





IPC ann alias snum mela ae 
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Show haw the following binary fractional values would be rounded to the nearest 
half (1 bit to the right of the binary point), according to the round-to-even rule. 
In each case, show the numeric values, both before and after rounding. 


A. 10.010; 
B. 10.011; 
C. 10.110, 
D. 11.001; 
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We saw in Problem 2! 46 that the Patriot missile Solare spptoxindted 0. 1: as x 
0. 00011001100110011001100,. Suppose instead that they had used TEBE round- 
td-even itlode to determine an approximation x’ to 0.1 witl 23 bits to the’ right of 
the binary point” 


A. What is the binary representation of x’? 

B. What is the approximate decimal value of x’ — 0.1? 

C. How far off would the computed clock haye been After 100 hours of opera- 
tion? 


D. How fan off would the program’s prediction of the position of the Scud 
missile have been? 





Consider the following i two 7 -bit floatine-caint Oen based on nihe TEEE 
floating-point fórmat.Neither has a sign bit—they can onlyrepresent nonnepatiwe 


numbers. coy 


1. Format A 
» There are k = 3 exponent bits. The exponent bias is 3. , 
» There are n = 4 fraction bits. ,, 
2. Format B i 
= There are k = 4 exponent bits. The exponent bias is 7. 7 
= There are n = 3 fraction bits. 


Below, you are given some bit patterns in format A, and your task is to convert 
them to the closest value in format B. If necessary, you should apply the roupd-to- 
even rounding rule. ‘In addition, give the values of numbers given by the format A 
and format B bit patterns. Give these as whole numbers (e. g., 17) or as fractions 


(e. g^ 17/64). i 





Format A Format B 
Bits Value Bits Value 
011 0000 1 0111 000 1 
101 1110 22x Bettiah uf 
010 1001 EUR SURE FR eg gets eS fat ee A 
1104011 UL en me en eat 
000 0001 


2.4.5 Floating-Point Operations 


The IEEE standard specifies a simple rule for determining the result of an arith- 
metic operation such as addition or multiplication. Viewing floating-point values x 
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and y as real numbers, and some operation © defined over real numbers, the com- 
putation should yield Round(x © y), the result of applying rounding to the exact 
result of the real operation. In practice, there are clever tricks floating-point unit 
designers use to avoid performing this exact computation, since the computation 
need only be sufficiently precise to guarantee a correctly rounded result. When 
one of the arguments is a special value, such as —0, oo, or NaN, the standard,spec- 
ifies conventions that attempt to be reasonable. 'For example, 1/—0 is defined to 
yield —oo, while 1/+0 is defined to yield +00. 

One strength of the IEEE standard’s method of specifying the behavior of 
floating-point operations is that it is independent of any particular hardware or 
software realization. Thus, we can examine its abstract mathematical properties 
without considering how it is actually implemented. 

We saw earlier that integer addition, both unsigned and two’s complement, 
forms an abelian group. Addition over real numbers also forms an abelian group, 
but we must consider what effect rounding has on these properties. Let us define 
x *! y to be Round(x + y). This operation is defined for all values of x and y, 
although it may yield infinity even when both x and y are real numbers due to 
overflow. The operation is commutative, with x +f y = y +f x for all values of x and 
y. On the other hand, the operation is not associative. For example, with single- 
precision floating point the expression (3.14+1e10)-1¢10 evaluates to 0. 0—tbe 
value 3.14 is lost due to rounding. On the other hand, the expression 3. 14* (1e10- 
1e10) evaluates to 3.14. As with an abelian group, most values have inverses 
under floating-point'addition, that is, x +f —x = 0. The exceptions are infinities 
(since +00 — oo = NaN), and NaNs, since NaN +! x = NaN for any x. 

The lack of associativity in floating-point addition is the most important group 
property that is lacking, It has important implications for scientific programmers 
and compiler writers. For example, suppose a compiler is given the following code 


fragment: 
x=atbtoe; 
y=zb+c+d; 


The compiler might be tempted to save one floating-point addition by generating 
the following code: 


tebtc; 
X-att; 
y=t+d; 


However, this computation might yield a different value for x than would the 
original, since it uses a different association of the addition operations. In most 
applications, the difference would be so small as to be inconsequential. Unfor- 
tunately, compilers have no way of knowing what trade-offs the user is willing to 
make between efficiency and faithfulness to the exact behavior of the original pro- 
gram. As a result, they tend to be very conservative, avoiding any optimizations 
that could have even the slightest effect on functionality. 
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On the other hand, floating-point addition satisfies the following monotonicity 
property: if a > b, then x 4a > x + b for any values of a, b, and x other than NaN, 
This property of real (and intéger) addition is not obeyed, by unsigned or'two’s- 
complement addition. "9 t 

Floating-point multiplication also obeys many of the properties one normally 
associates with multiplication, Let us define x * y to be Round(x*x y). This oper- 
ation is closed under multiplication (although possibly yielding infinity or NaN), 
it is commutative, and it has 1.0 as a multiplicative identity. On the other hand, 
it is not associative, due to the possibility ‘of overflow or'the loss of precision 
due to rounding. For example, with'single-precision floating.point; the expression 
(1e20*1820) *1e-20 evaluates to +00}; while :1e20* (1e20*16e-20) .evaluates'to 
1e20. In addition, floating-point multiplication does not'distribute over addition. 
For.example, with single-precision floating point, the:e&pression 1e20*(1e20- 
1620) evaluates to 0.0, while 1e20*1e20-1e20*1e20 evaluates to NaN. 

Qnmthe other hand/ floating-point multiplication satisfies the following mono: 
tonicity properties for any values of a, b, and c other than NaN: 

‘l 
i azb and cz 05 as izbe i 


Ww 


j n 
a>b and cx0-asczbs*c 


A 

In addition, we are also guaranteed that a *' a > 0, as long as q # NaN. As we 
saw earlier, none of,these monotonicity properties hold for unsigned or two's- 
complement multiplication. nm ) 

This lack of associativity and distributivity, is of serious concern to scientific 
programmers and to compiler writers. Even such aseemingly simple task as writing 
code tọ determine whether two lines-intersect in three-dimensional space can be 
a major challenge. 


2.4.6 Floating Point in C 


All versions of C proyidg two different floating-point data types; float and dour 
ble. On machines that support IÉEE floating point, these data types correspond 
to single- and double-precision floating point. In addition, the machines use the 
round-to-even rounding mode. Unfortunately, since the C standards do not re- 
quire the machine to use IEEE floating point, there are no standard methods to 
change the rounding mode or to get special values such as —0, +00, —oo, or NaN. 
Most systems provide a combination of include (.h) files and procedure libraries 
to provide access to these features, but the, details vary from one system to an- 
other. For example, the GNU compiler acc defines program constants INFINITY 
(for +00) and NAN (for NaN ) when the following sequence occurs in the program 
file: ^f nfi 
#define _GNU_SOURCE 1 t 

#include <math.h> 














Fili in ithe following macro definitions to generate the ne double: precisión values dis: 
—oo, and —0: 


*define POS INFINITY 
#define NEG INFINITY 
#define NEG ZERO 


You cannot use any include files (such as math. h), but you can make use of the 
fact that the largest finite number that can be represented with double precision 
is around 1.8.x 10°, 

; di 

Wlien casting values between int, float, and double formats, the program 
changes the numeric values and the bit representations as follows (assuming data 
type int is 32 bits): 


à 
* From int to f1oat, the number cannot overflow, but it may be rounded. 


* From int or float‘to double, the exact numeric value can be preserved be- 
cause double has both greater range (i:e.;the range of representable values), 
as well as greater precision (i.e., the number of significant bits). 


e From double to-float, the value can overflow to 4-oo or —co, since the range 
is smaller. Otherwise, it may be rounded, because the precision is smaller. 


* From float or double to int, the' value will be rounded toward zero. For 
example, 1.999 will be converted to 1, while —1.999 will be converted to 
—1. Furthermore, the value, may oSerflow. The C standards' do not specify 
a fixed result for this case. Intel-compatible microprocessors designate the 
bit pattern [10 - - - 00] (TMin,, for word size w) as an integer indefinite value. 
Any: conversion from floating point to ‘integer that cannot assign a reasonable 
intéger approximation yields this value. Thus, the expression (int) +1e10 
yields ~21483648, generating a negative, value from a positive one. 





Again variables: x, f, adde d: are 2 of type int, float, and double, respectively, 
Their Values are arbitrary,-except that neither f ñor d:equals +00, —oo, or NaN. 
For each of the following C expressions, either argue that it will-always be true 
(i.e. evaluate to 1) or give a value for the variablés such that it is not true (i.e., 
evaluates to 0). 


A. x== (int) (double) x 
x ==.¢€int) (float) x 
== (double) (float) d 
= (float) (double) f 
== -(-f) 


idu 


Section 2.4 Floating Point 































125 





i 126  Chapter2 Representing and Manipulating Information 


| E 1.0/2 == 1/2.0 
f G. d*d >= 0.0 

i H. (f+a)- 

| 

| 


2.5 Summary 


1 Computers encode information as bits, generally organized as sequences of bytes. 
| Different encodings are used for representing integers, real numbers, and charac- 
ter strings. Different models of computers use different conventions for encoding 

l numbers and for ordering the bytes within multi-byte data. 
The C language is designed to accommodate a wide range of different imple- 
mentations in:terms of word sizes and'numeric encodings. Machines with 64-bit 


| word sizes have become increasingly.common, replacing the 32-bit machines that 
dominated the market for around 30 years. Because 64-bit machines can also run 
| programs compiled for 32-bit machines, we have focused on the distinction be- 


tween 32- and 64-bit programs, rather than machines. The advantage of 64-bit pro- 
grams is that they can go beyond.the 4 GB address limitation‘of 32-bit programs. 
Most machines encode signed numbers using a two's-complement representa- 
tion and encode floating-point numbers using IEEE Standard 754. Understanding 
these encodings at the bit level, as well as understanding the mathematical char- 
acteristics of the arithrpetic operations, is important for writing programs that 
operate correctly over the full range of numeric values, 
j When casting between signed and unsigned integers, of the same size, most 
i C implementations follow the convention that the underlying bit pattern does 
not change. On a two’s-complement machine, this behavior is characterized by 
| functions T2U „ and U2T „, for a w-bit value. The implicit Casting of C gives results 
that many programmers do not anticipate, often leading to program bugs, 

Due to the finite lengths of the encodings, computer arithmetic has properties 
quite different from conventional integer and real arithmetic. The finite length can 

! cause numbers to overflow, when they exceed the range of the representation. 
Floating-point values can also underflow, when they are so close to 0.0 that they 
] are changed to zero. 
The finite integer arithmetic implemented by C, as well as most other pro- 
gramming languages, has some peculiar properties compared to true integer arith- 
metic. For example, the expression x*x can evaluate to a negative number due 
to overflow. Nonetheless, both unsigned and two's-complement arithmetic satisfy 
many of the other properties of integer arithmetic, including assocjativity, com- 
| mutativity, and distributivity. This allows compilers to do many optimizations. For 
, example, in replacing the expression 7*x by (x««3)-x, we make use of the as- 

sociative, commutative, and distributive properties, along with the relationship 
| between shifting and multiplying by powers of 2. 

We have seen several clever ways to exploit combinations of bit-level opera- 
tions and arithmetic operations. For example, we saw that with two's-complement 
arithmetic, ~x+1 is equivalent to -x. As another example, suppose we want a bit 
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Aside Ariane 5: The high.cost of floating-point overflow 


Converting large floating-point numbers to integers is a common source of programming errors. Such 
an error had disastrous consequences for the maiden voyage of the Ariane 5 rocket, on June 4, 1996. Just 
37 seconds after liftoff; the rocket veered off its flight path, broke up, and exploded. Communication 
satellites valued'at $500 million were ori board the rocket. + 

A later investigation [73, 33] showed that the'computer controlling the inertial navigation system 
had sent invalid data to the computer controlling the engine nozzles. Instead of sending flight control 
information, it had sent a diagnostic bit pattern indicating that'an óverflow had occurred during the 
conversion of a 64-bit floating-point number to a 16-bit signed integer. g 

The value that overflowed measured the horizontal velocity of the rocket, which could be more 
than five times higher than that achieved by the earlier Ariane 4 rocket. In the design of the Ariane 4 
softwate, they had carefully ‘analyzed the numeric values and determined that the horizontal velocity 
would never overflow a 16-bit number. Unfortunately, they simply reused this part of the software in 
the Ariane 5 without checking the assumptions on which it had been based. 








pattern of the form [0,...,0,1,..., 1], consisting of w — k zeros followed by k 
ones. Such bit patterns are useful for masking operations. This pattern can be gen- 
erated by the C expression (1<<k)-1, exploiting the property that the desired 
bit pattern has numeric value 2* — 1. For example, the expression (1««8)-1 will 
generate the bit pattern OxFF. 

Floating-point representations approximate real numbers by encoding num- 
bers of the form x x 2”. IEEE Standard 754 provides for several different preci- 
Sions, with the most common being single (32 bits) and double (64 bits). IEEE 
floating point also has representations for special values representing plus and 
minus infinity, as well as not-a-number. 

Floating-point arithmetic must be used very carefully, because it has only 
limited range and precision, and because it does not obey common mathematical 
properties such as associativity. 


Bibliographic Notes 


Reference books on C [45, 61] discuss properties of the different data types and 
operations. Of these two, only Steele and Harbison [45] cover the newer features 
found in ISO C99. There do not yet seem to be any books that cover the features 
found in ISO C11. The C standards do not specify details such as precise word sizes 
or numeric encodings. Such details are intentionally omitted to make it possible 
to implement C on a wide range of different machines. Several books have been 
written giving advice to C programmers [59, 74] that warn about problems with 
overflow, implicit casting to unsigned, and some of the other pitfalls we have 
covered in this chapter. These books also provide helpful advice on variable 
naming, coding styles, and code testing. Seacord's book on security issues in C 
and C++ programs [97] combines information about C programs, how they are 
compiled and executed, and how vulnerabilities may arise. Books on Java (we 
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recommend the one coauthored by James Gosling, the creator of the language [5]) 
describe the data formats and arithmetic operations supported by Java. : 

Most books on logic design [58, 116] have a section op encodings and arith- 
metic operations. Such books describe different ways of implementing arithmetic 
circuits. Overton’s book on IEEE floating point [82] provides a detailed descrip- 
tion of the format as well as the properties from the perspective of a numerical 
applications programmer. 

4 

Homework Problems 


H 


2.55 9 
Compile and run the sample code that uses show. bytes (file show-bytes.c) on 
different machines to which you have access. Determine the byte orderings used 


by these machines. ; 


2.56 9 h 
Try running the code for show. bytes for different sample values. 


2.57 € 

Write procedures show, short, show long, and show. double that print the byte 
representations of C objetts of types short, long, and double, respectively. Try 
thes& out on'several machines. 


2.58 99 

Write a procedure is little endian that will return 1 when compiled and run 
on a little-endian machine, and will return 0 when compiled and run on a big- 
endian machine. This program should run on any machine, regardless of its word 
size. 


2.59 99 

Write a C expression that will yield a word consisting of the least significant byte of 
x and the remaining bytes of y. For dperands x = 0x89ABCDEF and y = 0x76543210, 
this would give 0x765432EF. 

2.60 99 TR: | 
Suppose we number the bytes in a w-bit word from 0 (least significant) tow/8—1 i 
(most significant). Write code for the following C function, which will return 3} 
unsigned value in which byte i of argument x has been replaced by byte b: 





unsigned replace byte' (unsigned x, ‘int i, unsigned char b); 
p y gn Big ; 


Here are some examples showing how the functidn should work: ) 


replace byte(0x12345678, 2, OxAB) --> 0x12AB5678, 
replace byte(0x12345678, 0, OxAB) --> Ox123456AB | 
1 1 
' Bit-Level Integer Codirig Rules 


i In several of the following problems, we will artificially restrict what programming 
! constructs you can use to help you gain a better understanding of the bit-level, 








Homework‘Problems 


logic, and arithmetic operations:of C. In afiswering'thése problems,’ your code 
must follow these rules: 
` e Assumptions " 

* [ntegers are represented in two's-complement form‘ 

= Right shifts of signed data are performed arithmetically. 

* Data type int is w bits long. For some of the problems, you will be givéii a 
specific value for w; but othérwise your code should work as long as w'is a 
multiple of 8. You can use the expression sizeof (int) <<3 to'compüte w. 

* Forbidden 

* Conditionals (if or ?:), loops, switch statements, function calls, and macro 
invocations. 

* Division, modulus, and multiplication. 

* Relative comparison operators.(<, >, <=, and >=), I 

* Allowed operations 

* All bit-level and logic operations. 

* Left and right shifts, but only-with shift amounts between 0 and w = 1. + 

= Addition and subtraction. 

* Equality (==) and inequality (1-) tests. (Some of the problems do not allow 
these.) 

* Integer constants INT. MIN arid INT. MAX. 

= Casting between data types iüt and unsigned, either explicitly or im- 
plicitly. 


Even with these rules, you should try to make your code readable by choosing 
descriptive variable names and using comments to describe the logic behind-your 
solutions. As an example, the following code extracts the most significant byte 
from integer argument x: 


/* Get most significant byte from x */ 
int get msb(int x) { 

/* Shift by w-8 */ 

int shift val - (sizeof(int)-1)««3; 

/* Arithmetic shift */ 

int xright - x »» Shift val; 

/* Zero all but LSB */ 

return xright & OxFF; 


261 99 
Write C expressions that evaluate to 1 when the following conditions are true and 
to 0 when they are false. Assume x is of type int. 


A. Any,bit of x equals 1. 
B. Any bit of x equals 0. 


129 
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C. Any bit in theJeast significant byte of x equals 1. 
D. Any bit in the most significant byte of x equals 0. 


Your code should follow the bit-level integer coding rules (page 128), with the | 
additional restriction that you may not use equality (==) or inequality (!=) tests. 





262999 — 

Write a function int_shifts_are_arithmetic( that yields 1 when Tun on a 
machine that uses arithmetic right shifts for data type int and yields 0 otherwise. 
Your code should work on a machine with any word size. Test your code on several 
1 | machines. 

l 2.63 O04 

Fill in code for the following C functions. Function sr1 performs a logical right 
shift using an arithmetic right shift (given by-value xsra), followed by other oper- 
ations not including right shifts or division. Function sra»performs an arithmetic 
right shift using a logical right shift (given by value xst1), followed by other 
operations not including right shifts or division. You may use the computation 
8xsizeof (int) to determine w, the number of bits in data.type int. The shift 
amount k can range from 0 to w — 1. r 


unsigned srl(unsigned x, int k) { 3 
/* Perform, shift arithmetically -*/ : 
unsigned xsra = (int) x >> k; h 





} d 4 
i int sra(int x, int k) ( 


! 
i 
1 
j 
: ; /* Perform shift logically */ i 2 
s int xsrl = (unsigned) x >> k; 


} 
2.64 9 
Write code to implement thé following function: 


/* Return 1 when any odd bit of x equals 1; 0 otherwise. 19 
. Assume w=32 */ ” i 
int any_odd_one(unsigned x); 


Your function should follow the bit-level integer coding rules (page 128), 
except that you may assume that data type int has w = 32 bits. 
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265 9999 
Write code to implement the following function: T 


/* Return 1 when x contains an odd number of is; O otherwise. 
Assume w=32 x/ 
int odd_ones(unsigned x); 


Your function should follow the bit-level integer coding rules (page 128), 
except that you may assume that datatype int,has w = 32:bits, } 

Your code should contain a total of at most 12 arithmetic, bitwise, and logical 
operations. 


2.66 999 
Write code to implement the following function: 


/* 

* Generate mask indicating leftmost 1 in x. Assume w=32. 

* For example, OxFF0O -> 0x8000, and 0x6600 --» 0x4000. & 
* If x = 0, then return 0. 

*/ 


int leftmost_one(unsigned x); 


n 
Your function should follow the bit-level integer coding rules (page 128), 
except that you may assume that data type int has w — 32 bits. 
Your code should contain a total of at most.15 arithmetic, bitwise, and logical 
operations. ` 
Hint: First transform x into.a bit vector of the form [0 - - 011... 1]. 


2.67 99 

You are given the task of writing a procedure int size. is, 32() that yields 1 
when run on a machine for which an int is 32.bits, ahd yields 0 otherwise. You are 
not allowed to use the sizeof operator. Here is a first attempt: 


/* The following code does not run properly on some machines */ 


1 
2 int bad int size is 32() ( 

3 /* Set most significant bit (msb) of 32-bit machine */ 
4 int set msb - 1 «« 31; 

5 /* Shift past msb of 32-bit word */ 

6 int beyond msb = 1 << 32; 

7 

8 /* set msb is nonzero when word Size >= 32 

9 beyond_msb is zero when word size <= 32 */ 

10 return set msb && !beyond msb; : 

n Jj 


When compiled and run on a 32-bit SUN SPARC, however, this procedure 
returns 0. The following compiler message gives us an'indication of the próblem: 
^ ai 2 g 


warning: left shift count >= width, of type 


131 
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A. In what way does our code fail to comply with the C standard? 


B. Modify the code to run properly on any machine for which data type int is 
at least 32 bits, 

C. Modify the code to run properly on any machine for which data type int is 
at least 16 bits. 


2.68 997: 
Write code for a function with the following prototype: 


/* 

* Mask with least signficant n bits set to 1 

* Examples: n = 6 ~-> Ox3F, n = 17 --» Ox1FFFF 

* Assume 1 <= n <= W t 
*/ 

int lower_one_mask (int n); 


Your function should:follow the bit-level integer coding rules (page 128). Be 
careful of the case n = w. 


2.60 99 
Write code for a function with the following prototype: 


/* 

* Do rotating left shift. Assume 0 <= n <w 
* Examples when x = 0x12345678 and w = 32: 

* n-4!-» 0x23456781, n-20 -> 0x67812345 
*/ 


unsigned rotate left(unsigned x, int n); 


D i 
Your function should follow the bit-level integer coding rules (page 128). Be 
careful of the case n = 0. 


20 €9 
Write code for the function with the following prototype: 


/* 

* Return i when x can be represented as an n-bit, 2's-complement 
* number; 0 otherwise 

* Assume 1 <= n <= Ww 

*/ 

int fits bits(int x, int nj; ^ 


Your function should follow the bit-level integer coding rules (page 128). 


2.71 9 P T 

You just started working for a company that is implementing a s¢t of procedures 
to operate on a data structure where 4 signed bytes are packed into a 32-bit 
unsigned. Bytes within the word are numbered from 0 (least significant) to 3 
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(most significant). You have been assigned the task of implementing a function 
for a machine using two’s-complement arithmetic and arithmetic right shifts with 
the following prototype: 


/* Declaration of data type where 4 bytes are packed 
into an unsigned */ 
typedef unsigned packed_t; 


/* Extract byte from word. Return as signed integer */ 
int xbyte(packed_t word, int bytenum); 


That is, the function will extract the designated byte and sign extend it to be 
a 32-bit int. 
Your predecessor (who was fired for incompetence) wrote the following code: 


/* Failed attempt at xbyte */ 
int xbyte(packed t word, int bytenum) 
{ 
return (word >> (bytenum << 3)) & OxFF; 
} 


A. What is wrong with this code? 


B. Give a correct implementation of the function that uses only left and right 
shifts, along with one subtraction. 


272 O@ 

You are given the task of writing a function that will copy an integer val into a 

buffer buf, but it should do so only if enough space is available in the buffer. 
Here is the code you write: 


/* Copy integer into buffer if space'is available x/ 
/* WARNING: The ‘following code is buggy */ 
void copy int(int val; void *buf, int maxbytes) i 
if (maxbytes-sizeof(val) >= 0) 
memcpy(buf, (void *) £val, sizeof(val)); 


This code makes use of the library function memcpy. Although its use is a bit 
artificial here, where we simply want to copy an int, it illustrates an approach 
commonly used to copy larger data structures. i 

You carefully test the code and discover that it always copies the value to the 
buffer, even when maxbytes is too small. ; 


A. Explain why the conditional test in the code always succeeds. Hint: ‘The 
sizeof operator returns a value of type size_t. 


B. Show how you can rewrite-the conditional test to make it work properly. 
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2.73 4. 
Write &ode-for a function with the following prototype: 


/* Addition that saturates to TMin or TMax */ 
int saturating add(int x, int y); 


Instead of overflowing the way normal two's-complemént addition does, sat- 
urating addition returns TMax when there would be positive overflow, and TMin 
when there would be negative overflow. Saturating arithmetic is commonly used 
in programs that perform digital signal processing. 

Your function should follow the bit-level integer coding rules (page 128). 


24 ** T 
Write a function with the following prototype: 


/* Determine whether arguments can be subtracted without overflow */ 
int tsub ok(int x, int y); 


This function should return 1 if the computation x-y does not overflow. 


275 999 z 

Suppose we want to compute the complete 2w-bit representation of x - y, where 
both x and y are unsigned, on a machine for which data type unsigned is w bits. 
The low-order w bits of the product can be computed with the expression x*y, SO 
we only require a procedure with prototype 


unsigned unsigned high. prod(unsigned x, unsigned y); 


that computes the high-order w bits of x » y for unsigned variables. 
, We have access,to a library function with prototype 


int signed high prod(int x, int y); "! 


that computes the high-order w bits of x - y for the case where x and y are in two's- 
complement form. Write code calling this procedure to implement the function 
for unsigned arguments. Justify the correctness of your solution. 

Hint: Look at the relationship between the signed product x - y and the un- 
signed product x' - y' in the derivation of Equation 2.18, 


2.76 9 
The library function calloc has the following declaration: 


void *calloc(size_t nmemb, size't size); ' 


According to the library documentation, “The calloc function allocates memory 
for an array of nmemb elements of size bytes each. The memory is set to Zero. If 
nmemb or size is Zero, then cálloc returns NULL.” 

Write an implementation of calloc that performs the allocation by a call to 
malloc-ahd sets the memory to zero via memset. Your code shotld not have any 
vulnerabilities due to arithmetic overflow, and it should work correctly regardless 
of the number of bits used to represent data of type size_t. 

As a reference, functions malloc and memset have the following declarations: 
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void *malloc(size_t size); 
void *memset (void *s, int c, size t n); 


2.77 e* 

Suppose we are given the task of generating code to multiply integer variable x 
by various different constant factors K. To be efficient, we want to use only the 
operations *, -, and ««. For the following values of K, write C expressions to 
perform the multiplication using at most three operations per expression. 


A. K =17 

BR K=-7 

Cy K=60 

D. K = —112 
2.78 €9 


Write code for a function with the following prototype: 


/* Divide by power of 2. Assume 0 <=‘k < w-1 */ 
int divide_power2(int x, int k); 


The function should compute x /2* with correct rounding, and it should follow 
the bit-level integer coding rules (page 128). 


2.79 9$ 

Write code for a function mul3div4 that, for integer argument x, computes 3 x 
x/4 but follows the bit-level integer coding rules (page 128). Your'code should 
replicate the fact that the computation 3*x can cause overflow. 


280 99€ 

Write code Tor a function threefourths that, for integer argument x, computes 
the value of 3 4x, rounded toward zero, It should not overflow. Your function should 
follow the bit-level integer coding rules (page 128). 


281 € 

Write C expressions to generate the bit patterns that follow, where a* represents 
k repetitions of symbol a. Assume a w-bit data type. Your code may contain 
references to parameters j and k, representing the values of j and k, but not a 
parameter representing w. 


A. 10-50 
B. Qw»-k-j1^0j 


282 € 

We are running programs where values of type int are 32 bits. They are repre- 
sented in two's complement, and they are right shifted arithmetically. Values of 
type unsigned are also 22 bits. 


| 
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We generate arbitrary values x and y, and convert them to unsigned values as 
follows: > 


/* Create some arbitrary values */ 

int x = random(); 

int y-= random(); 

/* Convert to unsigned */ 

unsigned ux - (unsigned) x; i 
unsigned uy = (unsigned) y; 


For each of the following C expressions, you are to indicate whether or 
not the expression always yields 1. If it always yields 1, describe the underlying 
mathematical principles. Otherwise, give an example of arguments ‘that make it 


yield 0. ( 
A. (x<y) == (-x>-y) 
B. ((xty)<<4) + y-x zz d Tytib*x | 
C. ~x+~y+i == ~(xty) 
D. (ux-uy) == -(unsigned) (y-x) 


E. ((x >> 2) << 2) <=x 
Y 


2.83 99 i: 

Consider numbers having a binary representation consisting of an infinite string | 

of the form 0.y yy yy y--- , where y is a k-bit sequence. For example, the binary 

representation of 4 is-0.01010101 - -+ (y +01); while the representation of $ is 
0.00110011001Ł- - - (y = 0011). 

A. Let Y = B2U,(y), that is, the number having binary represehtatión y. Give 

a formula in terms of Y and k for the value-represented by the infinite string. 

Hint: Consider the effect of shifting the binary point k positions to the right. 


B,, What is the numeric value of the string for the following values of y? 
(a) 101 . 
(b) 0110 
(c) 010011 
















2.84 9 

Fill in the return value for the following procedure, which tests whether its first 
argument is less than or equdl to its second. Assume the fünction £2u returns an 
unsigned 32-bit number having the same bit representation as its floating-point 
argument. You can assume that neither argument is NaN. The two flavors of zero, 


+0 and —0, are considered equal. 










int float le(float x, float y) { 
unsigned ux = f2u(x); 
unsigned uy - f2u(y); 
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/* Get the sign bits */ 
unsigned sx = ux >> 31; 
unsigned sy = uy >> 31; 


/* Give an expression using only ux, uy, sx, and sy */ 
return ; 


} 


2.85 @ 

Given a floating-point format with a k-bit exponent and an n-bit fraction, write 
formulas for the exponent E, the significand M, the fraction f, and the value V 
for the quantities that follow. In addition, describe the bit representation. 


A. The number 7.0 

B. The largest odd integer that can,be represented exactly 

C. The reciprocal of the smallest positive normalized value 
2.86 9 
Intel-compatible processors also support an “extended-precision” floating-point 
format with an 80-bit word divided into a sign bit, k = 15 exponent bits, a single 
integer bit, and n = 63 fraction bits. The integer bit is an explicit copy of the 
implied bit in the IEEE floating-point representation. That is, it equals 1 for 
normalized values and 0 for denormalized values. Fill in the following table giving 
the approximate values of some “interesting” numbers in this format: 

Extended precision 

Description Value Decimal 





Smallest positive denormalized 
Smallest positive normalized 
Largest normalized 


This format can be used in C programs compiled for Intel-compatible ma- 
chines by declaring the data to be of type long double. However, it forces the 
compiler to generate code based on the legacy 8087 floating-point instructions. 
The resulting program will most likely run much slowér than would be the case 
for data type float or double. 


2.87 € 
The 2008 version of the IEEE floating-point standard, named IEEE 754-2008, 
includes a 16-bit “half-precision” floating-point format, It was originally devised 
by computer graphics companies for storing data in which a higher dynamic range 
is required than can be achieved with 16-bit integers. This format has 1 sign 
o exponent bits (k = 5), and 10 fraction bits (n = 10). The exponent bias às 
2-*-1=15. 

Fill in the table that follows for each of the numbers given, with the following 
instructions for each column: 











t 
| 
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Hex: The four hexadecimal digits describing the encoded form. 


M: The value of the significand. This should be a number of the form x or » 
where x is an integer and y is an integral power of 2. Examples include 0, 


E: The integer value of the exponent. 


V: The numeric value represented. Use the notation x or x x 27, where x and 
z are integers. 
D: The (possibly approximate) numerical value, as is' printed using the’ %f 


formatting specification of printf. 
Ei 


As an example, to represent the number 2 gp we would have s=0, M= 
and E = —1. Our number would therefore have an exponent field of ee. 
(decimal value 15 — 1 = 14) and a significarid field of 1100000000, giving a hex 
representation 3800. The numerical value is 0.875. ) 

You need not fill in entries marked —. 





Description _ Hex M E V D 
n_a 

d ditor EET -0 —0.0 
Smallest value > 2 etum s. uuu ea — is 

i) — o 512 5120 | 
Largest denormalized icone. Mice is egets RR r— | 
—oo — — —oo —oco 3 
Number with hex 3BBO us eie: Ei Iu | 





representation 3BBO : ‘ 


2.88 99 
Consider the following two 9-bit floating-point representations based on the IEEE 


floating-point format. 


1. Format A 
= There is 1 sign bit. 
's There are k = 5 exponent bits, The exponent bias is 15. y 
= There are n = 3 fraction bits. 

2. Format B 


« There is 1 sign bit. 
a „There are k-= 4 exponent bits, The expotient bias is 7. 
= There are m = 4 fraction'bits. 


p 


t 
Ih the following table, you are given some bit patterns in format A, and your 
task is to conyert them to the closest value in format B. If rounding is necessary 
you should round toward +00. In addition, give the values of numbers given by 
the format A and format B bit patterns. Give these as whole numbers (e.g., 17) or 
as fractions (e.g., 17/64 or 17/25). 
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Format A D Format B, 
Bits ` Value Bits Value — !:! H 


101111 001 € 10111 0010 Y 
010110011. .. 1. x 
100111010  . vun 
0 00000 111 

111100000 (. mee 
010111100 —— ey 











2.89 9 
We are running programs on a machine where ie of type int have a 32- 
bit two’s-complement representation. Values of type float use the 32-bit IEEE 
format, and values of type double use the 64-bit IEEE format. 

We generate arbitrary integer values x, y, and z, and convert them to values 
of type double as follows: 


/* Create some arbitrary values */ 
int x = random(); 

int y = random(); 

int z = random(); 

/* Convert to double */ 

double dx = (double) x; 

double dy = (double) y; 

double dz (double) z; 


For each of the following C expressions, you are to indicate whether or 
not the expression always yields 1. If it always yields 1, describe the underlying 
mathematical principles. Otherwise, give an example of arguments that make 
it yield 0. Note that you cannot use an IA32 machine runnifig ccc to test your 
answers, since it would use the 80-bit extended-precision representation for both 
float and double. 


A. (float) x == (float) dx 

B. dx - dy == (double) (x-y) 

C. (dx + dy) + dz == dx + (dy + dz) 
D. (d¥* ap * dz == dx * (dy * dz) 
E 


Lil 


» dx / dx == dz / dz 


290 € 

You have been assigned the task of writing a C function to compute „a floating- 
point representation of 2*. You decide that the best, way to do this is to, directly 
construct the IEEE single-precision representation of the result; When x is too 
small, your routine-will return 0.0. When x is'too large, it will return'4-oo. Fill in the 
blank portions of the code that follows to compute the correct result.'Assime the 
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function u2£ returns a floating-point value having an identical bit representation 
as its unsigned argument. 











float fpwr2(int x) 

{ 
/* Result exponent and fraction */ 
unsigned exp, frac; 
unsigned u; 


if (x < )1 
/* Too small. Return 0.0 */ 
exp = E 
frac - a 
) else if (x < 2 f 
" /* Denormdlized result */ 
exp = ! 
frac = n) 
} else if (x < ít 
/* Normalized result. */ 
exp = ... 
frac m. x5 
} eise { 
/* Too big. Return too */ 
exp = ; 





frac = WS 
} T 
. 1 
/* Pack exp and frac into,32 bits */4 
u = exp << 23 | frac; 
/* Return as float */ 
return u2f(u); 


291 € 

Around 250B.C., the Greek mathematician Archimedes proved that 25 << 2. 
Had he had access to a computer and the standard library «math . h>, he would have 
been able to determine that the single-precision floating-point approximation of 
x has the hexadecimal representation 0x40490FDB. Of course, all of these are just 
approximations, since is not rational. 








A. What is the fractional binary number dénoted by this floating-point value? 


B. What is the fractional binary representation of 29 Hint! See Problem 2.83. 


1 
C. At what bit position.(relative to the binary point) do these two approxima- 
tions to x diverge? i pa 
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Bit-Level Floating-Point Coding Rules 


In'the following problems, you will write code td implement floating“point func- 
tions, operating directly on bit-level representations of floating-point numbers. 
Your code should exactly replicate the conventions for IEEE floating-point oper- 
ations, including using round-to-even mode when rounding is required. 

To this end, we define data type float_bits to be equivalent to unsigned: 

f D " 

/* Access bit-level représentation floating-point number */ 
typedef unsigned float bits; z 


H 


Rather than using data typê ficat in yout code, yow will use f1oat_bits. 
You may use both int,and unsigned data types, including unsigned aid integer 
constants and operations. You may not use any unions, structs, or arrays. Most 
significantly, you may not use any floating-point data types, operations, or con- 
stants. Instead, your code should perform the bit manipulations that implement 
the specified floating-point operations. f 

The following function illustrates the use of these coding rules. For argument 
f, it returns +0 if f is denormalized (preserving the sign of f), and returns f 
otherwise. 

4 ^ 
/* If f is denorm, return 0. Otherwise, return f */ 
float bits float denorm zero(float bits f) ( 

/* Decompose bit representation into parts */ 

unsigned sign = f>>31; 

unsigned exp:= f>>23 & OxFF; 

unsigned frac = f & Ox7FFFFF; 

if (exp == 0) { l 

/* Denormalized. Set fraction to 0 */ 


frac = 0; 

} 

/* Reassemble bits */ 

return (sign << 31) | (exp << 23) | frac; 
} ix 
2.92 99 
Following the Bit-level floating-point coding, rules, implement the function with 
the following prototype: 


/* Compute -f. If f is NaN, then return f. */ 
float bits float ,negate(float bits f); 


For floating-point number f, this function computes — f. If f is NaN, your 
function should simply return f. 

Test your function by evaluating it for all 23? values of argument f and-com- 
paring the result to what would be obtained using your machine’s floating-point 
operations. 
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293 99 r 
Following the bit-level floating-point coding rules, implement the function with 
the following prototype: 3 





H 


/* Compute Ifl. If f is NaN, ‘then return Ë. */ 
float_bits float_absval(float_bits f); 


For floating-point number f, this function computes |f|. If f is NaN, your 
function should simply return f. x 

Test your function by evaluating it for all 232 values of argument £ and com- 
paring the result to what would be obtained using your, machine’s floating-point 


operations. : 


294 #04 
Following the bit-level floating-point coding rüles, implement the function with 
the following prototype: : 


/* Computé 2*f. If f is NaN, then return f. */ 
float bits float twice(float bits:f); 


For floating-point number f, this function computes 2.0: f. If f is NaN, your 
function should simply return /. 

Test your function by evaluating it for all 232 values Of argument f and com- 
paring the result to what would be obtained using your machine's floating-point 
operations. 


295 «99 
Following the bit-level floating-point coding rules, implement the function with 
the following prototype: )7 


/* Compute 0.5*f. If f is NaN, then return f. */ 
float bits float half(float bits f); 


For floating-point number f, this function computes 0.5. f. Tf f is NaN, your 
function should simply return f. 

Test your function by evaluating it for all 2 values of argument f anti com- 
paring the:result to what would be obtained using your machine's floating-point 
operations. “ 


296 O00 ; 
Following the bit-level floating-point coding rules, implement the function with 
the following prototype: 


$ 

/* 

* Compute i(int) f. d i 

* If conversion causes overflow or.f is NaN, return 0x80000000 
*/ 

int float f2i(float bits f); 





Solutions to Practice Problems 143 


For floating-point number f, this function computes (int) f. Your function 
should round toward zero. If f cannot be represented as an integer (e.g., it is out 
of range, or it is NaN), then the function should return 0x80000000. 

Test your function by evaluating it for all 23? våfùes of argument £ and com- 
paring the result to what would be obtained using your machine's floating-point 
operations. 


2.97 9999 
Following the bit-level floating-point coding rules, implement the function with 
the following prototype: 


/* Compute (float) i */ 
float bits float, iOf(int i); 


For argument i, this function computes the bit-level representation of 
(float) i. 

Test your function by evaluating it for all 252 yalues of argument f and com- 
paring the result to what would be obtained using your machine’s floating-point 
operations. 


Solutions to Practice Problems 


Solution to Problem 2.1 (page 37) 

Understanding the relation between hexadecimal and binary formats will be im- 
portant once we start looking at machiné-level programs. The method for doing 
these conversions is in the text, but it takes a little practice to beconte familiar. 


A. 0x39A7F8 to binary: 


Hexadecimal 3 9 A 7 F 8 

Binary 0011 1001 1010 0111 1111 1000 
B: Binary 1100100101111011 to hexadecimal: 

Binary: 1100 100], O11 1011 

Hexadecimal C 9 7 B 


C. OxD5E4C to binary: 


Hexadecinial D 5 E 4 c 
Binary 1101 0101 1110 0100 1100 

D. Binary 1001101110011110110101 to hexadecimal: 
Binary 10 0110 1110 0111 1011 0101 


Hexadecimal 2 6 E 7 B 5 


Solution to Problem 2.2 (page 37) 
This problem gives you a chance to think about powers of 2 and their hexadecimal 
representations, 
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n 2" (decimal) — 2" (hexadecimal) 
9 512: 0x200 i > 
19 524,288 0x80000 
14 16,384 $x40090 
16 65,536 0x10000 
17 131,072 0x20000 
5 32 0x20 
7 '128 0x80 ‘ 


Solution to Problem 2.3 (page 38) 

This problem gives you a chance to try out conversions between hexadecimal and 
decimal representations for some smaller numbers. For larger ones, it becomes 
much, more convenient and reliable to use;a calculator. or conversion program. 


Decimal Binary Hexadecimal, 

0 0000 0000 0x00 

167=10-16+7 1010 0111 OxAT 

62 —3.16-- 14 0011 1110 Ox3E 

188=11-16+12 10111100 OxBC ; 
3.16 4-7 —55 0011 0111 0x37 

8-16+8 = 136 1000 1000 0x88 À 
15-16+3=243 1111 0011 OxF3 ` 
5.1642-8 , 0101 0010 0x52, 

10.16--12—172 10101100 OxAC 

14-16-7231 1110 0111 OxE7 


Solution to Problem 2.4 (page 39) 

When you begin debugging Inachine-level programs, you will find many cases 
where some simple hexadecimal arithmetic would be useful. You can always 
convert numbers to decimal, perform the arithmetic, and convert them back, but 
being able to work directly in hexadecimal is hore efficient and informative. 


e 
A. 0x503c + 0x8 = 0x5044. Adding 8 to hex c gives 4 with a carry of 1. 


B. Ox503c — 0x40 = Ox4f£c. Subtracting 4 from 3 in the second digit position 
requires a borrow from the third. Since this digit is 0, we must also borrow 
from the fourth position. 


C. Ox503c + 64 = 0x507c. Decimal 64 (2°) equals hexadecimal 0x40. 


D. Qx50ea — 0x503c = Oxae. To subtract hex c (decimal 12) from hex a (decimal 
10), we borrow 16 from the second digit, giving hex e (decimal 14). In 
the second digit, we now subtract 3 from hex d (decimal 13), giving hex a 


(decimal 10). 
y? 


Solution to Problem 2.5: (page 48). >a a) 
This problem tests your understanding of the byte representation of data: and the 
two different byte orderings. 
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A. Little endian: 21 Big endian: 87 
B. Little endian: 21:43 ı Big endian: 87 65 
C. Little endian: 2143 65 Big endian: 87 65 43 


Recall that show. bytes enumerates a series of bytes starting from the one with 
lowest address arid’ working toward the one with highest address. On a little- 
endian machine, it will list tlie bytes from least significant to most. On a big-endian 
machine, it will list bytes from the most significant byte to the least. 


Solution to Problem 2.6 (page 49) i : 

This problem is another chance to practice hexadecimal to binary conversion. It 
also gets you thinking about integer and floating-point representations. We will 
explore these representations in more detail later in this chapter. 


A. Using the notation of the example in the téxt, we write the two strings as 
follows: 


0 0 3 5 9 1 4 1 
00000000001101011001000101000001 
SEO Oa IK 
4 A 5 6 4 5 O0 4 
01001010010101100100010100000100 


B. With the second word shifted two positions to the right relative to the first, 
we find a sequence with 21 matching bits. 


C. We find all bits of the integer embedded in the floating-point number, except 
for the most significant bit having value 1. Such is the case for the example 
in the text as well. In addition, the floating-point number has some nonzero 
high-order bits that do not match’thosé of the integer. 


Solution to Problem 2.7 (page 49) 

It prints 61 62 63 64 65 66. Recall also ‘that the library routine strlen does not 
count the terminating null character, and so show, bytes printed only through the 
character ‘f’. 4 


Solution to Problem 2.8 (page 51) 
This problem is a drill to help you become more familiar With Boolean operations. 





Operation Result 

a [01101001] 
b [01010101] 
~a [10010110] 
~b [10101010] 
a&b [01000001] 
alb [01111101] 


a^b [00111100] 





— = MICE E p 1 a 
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Solution to Problem 2.9 (page 53) : 

This problem illustrates how Boolean algebra'can be used to describe and reason 

about real-world systems. We can see that this color algebra.is identical to the 

[sn Boolean algebra over bit vectors of length 3. 

A. Colors are complemented by.complementing the values of R,G, and B. 
From this, we can see that white is the complement of black, yellow is the 
complement of blue, magenta is the complement of green, and cyan is the 
complement of red. 

B. We perform Boolean operations based on a bit-vector representation of the 
colors. From this we get the following: d a 


Blue (001) | Green (010) = Cyan (011) 
Yellow (110) & Cyan (011) =, Greep (010) , E ^ 
Red(100) ^  Magenta(101) = Blue (001) 


Solution to Problem 2.10 (page 54) 
This procedure relies on the fact that EXCLUSIVE-0R is comrhutative arid associative, 


and that a ^ a =0 for any a. * 

Step *X xy 9^ n 

Initially 4 b i | 
Step 1 a ,4^b i 
Step 2 a~(a~b)=(a.*a)"b=b a^b 

Step 3 b Pi z b- (a ~b) =b” Baca 


See Problem 211 for a case where this fuliction will fail. s 
WL ak 


Solution to Problem 2.11 (page 55) 
This problem illustrates a subtle, and interesting feature of our inplace swap 


routjne. f 
A. Both first and last have value k, so we are attempting to swap the middle — . 
element with itself. 
B. In this case, arguments,x and y to inplace_swap both point to the same , 
location. When we compute *x ^ *y, we get 0. We then store Qasthe middle — 
element of the array, and the subsequent steps keep setting this element to 
0. We can see that our reasoning in Problem 2.10 implicitly assumed that x 
and y denote different locations. 
C. Simply replace the test in line 4 of reverse array to be first < last, since 
there is no need to swap the middle element with itself’ : 


Solution to Problem 2.12 (page 55) 
Here are the expressions: 


ORAE ERES 
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A. x & OxFF 
B. x^ ~OxFF 
C. x | OxFF 


These expressions are typical of the kind commonly found in performing low-level 
bit operations. The expression -OxFF creates a mask where the 8 least-significant 
bits equal 0 and the rest equal 1. Observe that such a mask will be generated 
regardless of the word size. By contrast, the expression 0xFFFFFF00 would only 


work when data type int is 32 bits, 
1 v 


uf » 
Solution to Problem 2.13 (page 56) 
These problems help you think about the relation between Boolean operations 
and typical ways that programmers apply masking operations. Here is the code: 


/* Declarations of functions implementing operatiotis. bis and bic «/ 
int bis(int x, int m); 
int bic(int x, int m); 


/* Compute x|y using only calls to functions bis and bic */ 
int bool or(int x, int y) ( 
int result = bis(x,y); : 
return result; 


} 


/* Compute x^y using only calls’ to functions bis and bic */ 
int bool_xor(int x, int y) { 

int result = bis(bic(x,y), bic(y,x)); 

return result; 


} 


The bis operation is equivalent to Boolean or—a bit is set in ¢ if either this 
bit is set in x or it is set in m. On the other hand, bic(x, m) is equivalent to x & ~m; 
we want the result to,equal 1 only when the corresponding bit of x is 1 and of mis 
0. 

Lay 

Given that, we can implement | with a single call to bis. To implement ^, we 

take advantage of the property 


x^ycm(x&-y)l(-x&y) 
h 


Solution to Problem 2.14 (page 57) 

This problem highlights the relation between bit-level Boolean operations and 
logical operations in C. A common programming error is to use a bit-level oper- 
ation when a logical one is intended, or vice versa. 
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| Expression Value Expression Value 

[E RR a —— — M ———————M— 

mm x&y 0x20 x kk y 0x01 

E xly Ox7F xIly Ox01 

1 ~x | ~y OxDF Ix || fy 0x00 
: xu ty 0x00 X bk ~y Oxot o, 


Solution to Problem,2.15 (page 57) 

q The expression is ! (x ^ y). 

2 That is, x^y will be zero if and only if every bit of x matches the corresponding 

' | bit of y. We then exploit the ability of ! to determine whether a word contains any 
nonzero bit. 

There is no reál reason to use this expression rather'than simply writing x == 

| y; but it demonstrates some^of the nuances'of bit-level and logical operations. 


L| Solution to Problem:2.16 (page 58) 
NE This problem is a drill to help you understand the different shift operations. 





oxca [11000011] [00011000] oxi [00110000] 0x30 [11110000] OxFO 


| | Logical Arithmetic 

: | x x<<3 xX >> 2 X>>2 

: Hex Binary Binary Hex Binary Hex Binary Hex 
b hint ŘS 
E. 


: ox75 [01110101] [10101000]  oxas [00011101]  Oxip [00011101] ^ Ox1D 
|| ox87 [10000111] [00111000] 0x38 [00100001] 0x21 [11100001]  OxEi 
; d ox66 [01100110] [00110000]  oxso 00011001] 0x19 [00011001] 0x19 


Solution to Problem 2.17 (page 65) 
In general, working through examples for very small word sizes is a very good way 
to understand computer arithmetic. 

The unsignéd values correspond to those in Figure 22. For the'two's- 
complement values, hex digits 0 through 7 have a most significant bit of 0, yielding 
honnegative values, while hex digits 8 through F have a móst significant bit of 1, 
yielding a DEEatve value. 


Hexadecimal Binary 





P o! OxE [1110] «23 +22 +21 = 14 -2 42242) = -2 
Er C OxO [0000] © 0 

A 0x5, [0101] | 27 +29 =5 2-25 

1 ox8 [1000] 235—8 , BeBe yy, 

1 OxD [101] 2242 +29 =13 D2-242 2-3 


OxF piu] 2424242215. 24242 422-1 


oom E 2 owe 
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Solution to.Problem 2.18 (page 69) » 

,  Fora32-bit word, any value consisting of 8 hexadecimal digits beginning with one 
| ofthe digits 8 through £ represents a negative number. It is quite common to see 
| numbers beginning with a string of f's, since the leading bits of a negative number 
'  areallones You must look carefully, though. For example, the number 0x8048337 
has only 7 digits. Filling this out witha leading zero gives 0x08048337, a positive 
number. 

M 


4004d0: 48 81 ec eO 02 00 00 Sub, $0x2e0,%rsp A. 736 
, 4004d7: 48 8b 44 24 a8 mov j-0x58Cirsp) ,%rax B. -88 
| 4004dc: 48 O3 47 28 add 0x28(%rdi) ,%rax C. 40 

4004e0; 48 89 44 24 d0 mov %rax,-0x30(%rsp) D. -48 

4004e5: 48 8b 44 24 78 mov 0x78(%rsp),%rax E. 120 

4004ea: 48 89 87 88 00 00 00 mov %rax,0x88(%rdi) F. 136 

4004f1: 48 8b 84 24 f8 01 00 mov Oxif8(%rsp),%rax G. 504 

4004f8: 00 

4004f9: 48 03 44 24 08 add — Ox8 CArsp) , 4rax 

4004fe: 48 89 84 24 cO 00 00 mov %rax,O0xc0(%rsp) H. 192 

400505: 00 

400506: 48 8b 44 d4 b8 mov -Ox48(%rsp,%rdx,8) ,%rax I. -72 


Solution to Problem 2.19 (page 71) 
The functions 72U and U2T are very peculiar from a mathematical perspective. 
It is important to understand how they behave. 

We solve this problem by reordering the rows in the solution of Problem 2.17 
according to the two's.complement value and then listing the unsigned value as 
the result of the function application. We show the, hexatlecimal values to make 
this process more concrete. 


X (hex) x T2U4(x) 


0x8 —8 8 
OxD —3 13 
OxE -2 14 
OxF -1 15 
0x0 0 0 
Ox5 5 5 


Solution to Problem 2.20 (page 73) 
This exercise tests your understanding,of Equation 2.5. : 

For the first four entries, the values of x are negative and 72U4(x) =x + 2^. 
For the remaining two entries, the values of x are nonnegative and 72U 4(x) =x. 


Solution tp Problem 2.21 (page 76) 
This problem.reinforces your understanding.of the relation between two's- 
complement and unsigned representations, as well as the effects of the C promo- 
tion rules. Recall that 7 Min; i$ —2,147,483,648, and that when cast to unsigned it 
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| . becomes 2,147,483,648. In addition, if either operand is unsigned, then the other 


operand will be cast to unsigned before comparing. 
Im 


1 


Expression Type Evaluation 
-2141483647-1 == 21474836480 ^ Unsignedi; 1 " 
Woo -2147483647-1 < 2147483647 Signed 1 
d. -2147483647-1U < 2147483647 Unsigned 0 
-2147483647-1« -2147483647 Signed 1 
Unsigned i 


| | -2147483647-1U « -2147483647 


| Solution to Problem 2.22 (page 79) 
This exercise provides à concrete-demonstration of how sign extension preserves 
the numeric value ofa two’s-complement representation. 


A. [1011] 22 4214 2 = -84241 = -5 
B. [i011] 2443421429 = 16484241 = -$ 
| C. [111011] 22342542 42042. = —32+1648+2+1 = -5 

| Solution to Problem 2:23 (page 80) e 


The expressions in'thésé functions are common program "idioms" for extracting 

values from a word in which multiple bit fields have been packed. Théy exploit 

the zero-filling 4nd sign-extending properties of the different shift operations. 

Note carefülly the ordering of the cast and shift operations. In funt, the shifts 

1 are‘performed on unsigned variable word and hence are logical. In fun2, shifts 
are performed after casting word to int and hence are arithmetic. 


A. W 


| 0x00000076 
| 0x87654321 
0x000000C9 
OxEDCBA987 


B. Function funi extracts 

giving an integer ran 

from the low-order 8 

y The result will be a number betweeh —12 


+ 


funi (w) 


0x00000076 
0x00000021 
0x000000C9 
0x00000087 


fun2(w) 


0x00000076 
0x00000021 
OxFFFFFFC9 
OxFFFFFF87 


ging between 0 and 255. Function 
bits of the argument, butit also performs sign extension. 
8 and 127. 


Ei Solution to Problem 2.24 (page 82) 


i 
The effect of truncation is fairly intuitive for unsi 
complement numbers. This exercise lets you exp 


| . word sizes. 
| 


t 


Li 


lore its properties usin. 


a value from the low-order 8 bits of the argument, 
fun2 extracts a value 


gned numbers, but not for two’s- 
gvery small 


+ fe 


py 
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Hex Unsigned Two's complement 
Original Truncated Original Truncated Original : Truncated 
0 0 0 0 0 0 
2 2 2 2 2 2 
9 1 9 1 —7 1 
B 3 11 3 -5 3 
F 7 15 7 -1 - 


As Equation 2.9 states, the effect of this truncation on unsigned values is to 
simply find their residue, modulo 8. The effect of the truncation on signed values 
is a bit more complex. According to Equation2.10, we first compute the modulo 8 
residue of the argument. This will give values 0 through 7 for arguments 0 through 
7, and also for arguments.—8 through —1. Then we apply function U2T3 to these 
residues, giving two repetitions of the sequences 0 through 3 and —4 through —1. 


Solution to Problem 2.25 (page 83) 

This problem is designed to demonstrate how easily bugs can arise due to the 
implicit casting from signed to unsigned. It seems quite natural to pass parameter 
length as an unsigned, since one would never want to use a negative length. The 
stopping criterion i <= length-1 also seems quite natural. But combining these 
two yields an unexpected outcome! 

Since parameter length is unsigned, the corhputation 0 — lis performed using 
unsigned arithmetic, which is equivalent to modular addition. The result is then 
UMax. The < comparison is also performed using an unsigned comparison, and 
since any number is less than or equal to UMax, the comparison always holds! 
Thus, the code attempts to access invalid elements of array a. 

The code can be fixed either by declaring length to be an int or by changing 
the test of the for loop to De i « length. 


Solution to Problem 2.26 (page 83) 

This example demonstrates a subtle feature of unsigned arithmetic, and also the 
property that we sometimes perform unsigned arithmetic without realizing it. This 
can lead to very tricky bugs. 


A. For what cases will this function produce an incorrect result? The function 
will incorrectly return 1 when s is shorter than t. 


B. Explain how this incorrect result comes about. Since strlen is defined to 
yield an unsigned result, the difference and the comparison are both com- 
puted using unsigned arithmetic. When, s is shorter than t, the difference 
strlen(s) - strlen(t) should be negative, but instead becomes a large, 
unsigned number, which is greater than 0. 


C. Show how to fix the code so that it will work reliably. Replace the test with 
the following: x 


return strlen(s) » strlen(t); 
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Solutionrto Problem 2.27 (page 89) 
This function is a direct implementation of the rules given to determine whether 
or not an unsigned addition overflows. 


/* Determine whether arguments can be added without overflow */ 
int uadd ok(unsigned x, unsigned y) 1 

unsigned sum = xty; 

return sum >= x; 


} 


LI 
Solution'to Problem 2.28 (page 89) 2 
This-problem is a' simple demonstration of arithmetic modulo 16. The easiest way 
to solve itis to convert'thé hex pattern into its unsigned decimal value. For nonzero 
values of x, we must have (-4 x) + x — 16. Then we convert the cómplémented 


value back to hex. wo me 
e 
x Tax i 
Hex Decimal Decimal Hex . 
0 0 0 0 y 27 u 
5 5 11 B 
8 Bi 8 8 j 
D 13 " 3 3 
F 15 i 1 1 


i 
Solution to Problem 2.29: (page 93) 
This problem is an exercise to make sure you understand twó's-complement 


addition. 
xoc y "x4 y ATY Case 
—12 —15 —27 5 1 
{10100] [10001] [100101] [00101] 
—8 —8 —16 —16 2 
[11000] [11000] [110000] [10000] 
-9 RU ES -1 2: 
[10111] [01000]  [n1ninf — [11111] 
2 5 7 7 3 ; | 
[00010] — [00101] [000111], [00111] 4 
12 4 16 —16 4 


[01100] [00100] [010000] [10000] . 
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Solution to Problem 2.30 (page 94) 
This function is a direct implementation of the rules given to determine whether 
or not a two’s-complement addition overflows, 


/* Determine whether, arguments can be added without overflow */ 
int tadd ok(int x, int y) ( 

int sum = x+y; 

int neg over =x < 0 &k y < 0 && sum >= 0; 

int pos over = x >= 0 && y >= 0 && sum < 0; 

return !neg_over && !pos_over; 


} 


Solution to Problem 2.31 (page 94) 

Your coworker could“have learned, by studying Section 2.3.2, that two’s- 
complement addition forms an abelian group, and so the expression (xt+y)-x 
will évaluate to y regardless of whethet ‘or not the addition overflows, and that 
(x+y)-y will always evaluate to x. 


Solution to Problem 2.32 (page 94) 
This function will give correct values, except when y is TMin. In this case, we 
will have -y also equal to TMin, aiid so the cali to function tadd_ok will indicate 
overflow when x is negative and no overflow when x is nonnegative. In fact, the 
Opposite is true: tsub, ok (x , TMin) should yield 0 when x is negative«and 1 when 
it is nonnegative. 1 

One lesson to be learned from this exercise is that TMin should be included 
as one of the cases imany test procedure for a function. 


Solution to Problem 2.33 (page 95) 
This problem helps you understand two ’s-complement negation using a very small 
word size. : : 

For w — 4, we have TMin, = —8, So —8 is its own additive inverse, while other 
values are negated by integer negation. 








x ee 
Hex Decimal Decimal Hex 
0 Q 0 0 
5 5 -5 B 
8 —8 —8 8 
D —3 3 
F -1 1 1 


The bit patterns are the same as for unsigned negation. 


Solution to Problem 2.34 (page 98) 
This problem is an exercise to make sure you understand two's-complement 
multiplication.  ; ; 
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Mode x y x+y Truncated x y 
Unsigned 4 [100] 5 [101] 20 [010100] 4 [100] 
Two’s complement —4 [100] -3 [10T] 12 1001100] -4 [100] 
Unsigned ' 2 [010] 7 HU] 14 [001110] 6 [110] 
Two's complement 2 [010  -1 [111]  -2 111110) -2 [110] 
Unsigned 6 [110] 6 [110] 36 . [100100] 4 [100] 
Two's complement —2 [110] —2 [110] 4 [000100 -4 [100] 


Solution to Problem 2.35 (page 99) 
It is not realistic to test this function for all possible values of x and y. Even if 
you could run 10, billion tests per second, it would require over 58 years to test all 
combinations when data type int is 32 bits. On the other hand, it is feasiple to test 
your code by writing the function with data type short or char and then testing 
it exhaustively. 

Here's a more principled approach, following the proposed set of arguments: 


P 


We know that x : yican be written as a 2w-bit two's-complement number. Let 

u denote the unsigned number represented by the lower w bits, and v denote 

the two's-complement number represented by the upper w bits: Then, based 

on Equation 2.3, we can see that x - y = v2" + u. 

We also know that u = T2U „(p), since they are unsigned and two's- 
complement numbers arising from the same bit pattern, and so by, Equation 
2.6, we can write u = p + p,,..12", where p,,_1 is the most significant bit of p. 
Letting ? = v + p, ,, we have x : y = p +12”, 

When: —0, we have x - y = p; thé multiplication does not overflow. When 
t £0, we have x - y Æ p; the muulepucaticn does overflow. 

2. By definition of integer división, dividing p by nonzero x gives a quotient 
q and a remainder r such that p = x - qtr, and [7| < Ix]. (We use absolute 
values here, because the signs of x and r may differ. For example, dividing —7 
by 2 gives quotient —3 and remainder —1.) 

3. Suppose q = y. Then we have x: y — x: y +r +12”. From this, we can see 

that r 4- :2" — 0. But |r| « |x| x 2", and so this identity can hold only if t 2 0, 

in which case r = 0. 

Suppose r = t = 0. Then we will have x : y =x : q, implying that y =q. 


When x equals 0, multiplication does not overflow, and so we see that our code 
provides a reliable way to test whether or not two's-complement multiplication 
causes overflow. 


Solution to Problem:2.36 (page.99) 
With 64 bits, we can perform the multiplication without overflowing. We then test : 
whether casting the product to 32 bits changes the value: 
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1 /* Determine whether the arguments: can:be' multiplied 
2 without overflow */ 

3 int tmult ok(int x, int y) { 

4 /* Compute product without overflow */ 

5 int64 t pil - (int64 t) x*y; 

6 /* See if casting to int preserves value */ 

7 return pll == (int) pll; 

8 


Note that the casting on the right-hand side of line 5 is critical. If we instead 
wrote the line as 


int64 t pll - x*y; 


the product would be computed as a 32-bit value (possibly overflowing) and then 
sign extended to 64 bits. 


Solution to Problem 2.37 (page 99) 


A. This change does not help at all. Even though the computation of asize will 
be accurate, the call to malloc will cause this value to be converted to a 32-bit 
unsigned number, and so the same overflow conditions will occur. 


B. With malloc having a 32-bit unsigned number as its argument, it cannot 
possibly allocate a block of more than 22? bytes, and.so'there is no point 
attempting to allocate or copy this much memory. Instead, the function 
should abort and return NULL, as illustrated by the following replacement 
to the original call to malloc (line 9): 


uint64_t required Size = ele cnt * (uint64 t) ele size; 
size t request size = (size t) required size; 
if (required size != request, size) 
/* Overflow must have occurred. Abort operation */ 
return NULL; 
void *result = malloc(request, size); 
if (result -- NULL) 
/* malloc failed */ 
return NULL; 


Solution to Problem 2.38 (page 102) 
In Chapter 3, we will see many examples of the Lea instruction in action. The 
instruction is provided to support pointer arithmetic, but the C compiler often 
uses it as a way to perform multiplication by small constants. 

For each value of k, we can compute two multiples: 2* (when b is 0) and 2^ + 1 
(when b is a). Thus, we can compute multiples 1, 2, 3, 4, 5, 8, and 9. 
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Solution to Problem 2.39" (page'103) at 
The expression simply becomes -(x««m). To see this, let the word size be w so that 
n = w — 1. Form B states that we should compute (x<<w) - (x<<m), but shifting 
x to the left by w will yield the value 0. y 


Solution to Problem 2.40 (page 103) 
This problem requires you to try out the optimizations already described and also 
to supply a bit of your own ingenuity. 


K Shifts  Add/Subs ^ Expression f 
6 2 1 (x««2) + (x<<1) 

31 1 1 (x««5) - x 

—6 2 1 (x««1) - (x<<3) 

55. 2 2 (x««6) - (x<<3) - x 


Observe that the fourth case uses a modified version of form B. We can view 
the bit pattern [110111] as having a run of 6 ones with a zero in the middle, and so 
we apply the rule for form B, but then we subtract the term corresponding to the 
middle zero bit. j 


)* 


Solution to Problem 2.41 (page 103) 
Assuming that additión and subtraction have the same performance, the rule is 
to choose form A when n = m, either‘form when n =m +1, and form B, when 
nml. "BE Ie ¥ 

The justification for this rule is as'follows. Assume first that m > 0. When 
n — m, form A requires only a single shift, while form B requires two shifts 
and a subtraction. When n =m +1, both forms fequire two shifts and either an 
addition.or a subtraction» When n > m +1, form B requires only two shifts and one 
subtraction, while form A requires n — m + 17 2 shifts.and.n —m > 1 additions. 
For the case of m =0, we get one fewer shift for both forms A and B, and so the 
same rules apply for choosing between the two., 


p 


Solution to Problem 2.42 (page 107) " 

The only challenge here is to compute the bias without any testing or conditional 
operations. We use the trick that the expression x »»'31 generates a word with all 
ones if x is negative, and all zeros otherwise. By masking off the appropriate bits, 
we get the desired bias value. | 


"^ 


int divi6(int.x) { 

/* Compute bias to be either 0 (x »- 0) or 15 Kx < 0) */ œ 
int bias = (x 2» 31)^£& OxF; 

return (x +'bias) >> 4; E 
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Solutions to Practice Problems 


Solution to Problem 2.43 (page 107) 


We have found that people have difficulty with this exercise when working di- 


rectly with assembly code. It becomes more clear when put in the form shown in 
optarith.. 


We can see that M is 31; x«M is computed as (x««5) -x. 


We can see that N is 8; a bias value of 7 is added when y is negative, and the 
right shift is by 3. 


Solution to Problem 2.44 (page 108) 
These "C puzzle" problems provide a clear demonstration that programmers must 
understand the properties of computer arithmetic: 
A. (x »0) |] (x-1 <0) 
False. Let x be —2,147:483,648 (TMins;). We will then have x-1 equal to 
2,147,483,647 (TMax3y). 
B. (x&7) !- 7 || (x««29 < 0) , 
True. If (x & 7) !=7 evaluates to 0, then we must have bit x2 equal to 1. 
When shifted left by 29, this will become the sign bit. 
C. (x * x) >=0 
False. When x is 65,535 (OxFFFF), x*x is —131,071 (OxFFFE0001). 
D. x<0 || =x <=,0, 
True. If x is nonnegative, then -x is nonpositive. 
E. x>0 || -x >=0 
False. Let x be —2,147,483,648 (TMin3). Then both x and -x are negative. 
E x+y == uytux 
True. Two’s-complement and unsigned addition have the same bit-level be- 
havior, and they are commutative. 


G. x*-y + uy*ux == -x 
True. ~y equals -y-1. uy*ux equals x*y. Thus, the left-hand side is equivalent 
to x*-y-xtx*y. 


Solution to Problem 2.45 (page 111) 
Understanding fractional binary representations is an important step to under- 
standing floating-point encodings. This exercise lets you try out some simple ex- 
amples. 
0.001 0.125 
0.11 0.75 
1.1001 1.5625 
10.1011 2.6875 
1.001 1.125 
101.111 5.875 
11.0011 3.1875 
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One simple way to think about fractional binary representations is to repre- 
sent aenumber as a fraction of the form 7;. We can write this in binary using the 
binary representation of x, with the binary point inserted k positions from the 
right. As an example, for% , we have 25,9 = 110015. We then put the binary point 
four positions from the right to get 1.10019. 


Solution to Problem 2.46 (page 111) i 
In most cases, the limited precision of floating-point numbers is not a major 
problem, because the relative error of the computation is still fairly low.'Ín this 
example, however, the system was sensitive to the absolute error. 


A. We can see that 0.1 — x has the binary representation 


0.000000000000000000000001100[1100] - - - 2 
B. Comparing this to the binary representation of i» wecansee that it is simply 
29 x 4, which is around 9.54 x 1078. € 
C. 9.54 x 1078 x 100 x 60 x 60 x 10 220.343 seconds. 
D. 0.343 x 2,000 œ~ 687 meters. 
Solution to Problem 2.47 (page 117) 


Working through floating-point representations for very sinall word sizes helps 
clarify how IEEE floating point works. Note especially the transition between 


denormalized and normalized values. x 
Bis e E 2 f M ÓxM V —Dedmaü 
oqo o 0 1 2 $y, $ 0 00 
o001 0 0 1 Gf 4| i 1 0.25 
00010 0 0 1 4 41 A i 0.5 
00011 0 0 1 4j i i i "0.75 
00100 1 0 1 2? å $ 1 1.0 
oo101 1 0 1 4 4j i i 125 
oui) 1 G9 1 2 $ é 3 L5 . 
ooit 1 0 1 3j ił i 1 1.75 
0100 2 1 2 $ § 1 2 2.0 
01000 2 1 2 id i 10 3 25 
01010 2 1 2 4 £ 2 3 3.0 
01011 2 1 2 ¢ 1 14 i 3.5 
01100 — — — — — -— oo — 
0110 — — — — — — NaN —, 
0 11 10 — — >~ oc Uu — NaN — 
01411 — — — — = — NaN — 
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Solution to Problem 2.48 (page 119) 


Hexadecimal 0x359141 is equivalent to binary (110101 1001000101000001]. Shift- 
ing this right 21 places gives 1.101011001000101000001, x 2?!, We form the frac- 
tion field by dropping the leading 1 and adding two zeros, giving 


[10101100100010100000100] 


The exponent is formed by adding bias 127 to 21, giving 148 (binary [10010100]). 
We combine this with a sign field of 0 to give a binary representation 


[01001010010101 100100010100000100] 


We see that the matching bits in the two representations correspond to the low- 
order bits of the integer, up to the most significant bit equal to 1 matching the 
high-order 21 bits of the fraction: 


0 0 3 5 9 1 4 1| 
00000000001101011001000101000001 
ARO OIC ISG Ee 


4 A 5 6 4 5 0 4 
01001010010101100100010100000100 


Solution to Problem 2.49 (page 120) 
This exercise helps you think about what numbers cannot be represented exactly 
in floating point. 
^e 
A. 'The number has binary representation 1, followed,by n zeros, followed by 1, 
giving value 2771 + 1, 
B. When n = 23, the value is 27 + 1 = 16,777,217. 


Solution to Problem 2.50 (page 121) 


Performing rounding by hand helps reinforce the idea of round-to-even with 
binary numbers. 





Original Rounded 
10.0102. 21 100 2 
10011; 28 101 21 
101100 2; 110 3 

1 


11.001, 3g 1L0 3 
Solution to Problem 2.51 (page 122) 
A. Looking at the nonterminating sequence for D we see that the 2 bits to the 
right of the rounding position are 1, so a better approximation to b would be 


obtained by incrementing x to get x’ = 0.00011001100110011001 1015, which 
is larger than 0.1. 


B. We can see that x’ — 0.1 has binary representation 
J 


0.0000000000000000000000000[1100] 
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Comparing this to the binary representation of D we can see that it is 
2-2 x d, which is around 2.38 x 1078. 

C. 2.38 x 10-9 x 100 x 60 x 60 x 10 440.086 seconds, a factor of 4 less than the 
error in the Patriot system. 


D. 0.086 x 2,000 + 171 meters. 


Solution to Problem 2.52 (page 122) l 
This problem tests a lot of coricépts about floating-point representations, including 
the encoding of normalized and denormalized values, as well as rounding. 





Format A Format B 
Bits "Value Bits Value Comments 
011 0000 1 0111 000 1 
101 1110 Bn 1001 111 n 
010 1001 a 0110 100 3 Round down 
110 1111 E 1011 000 16, ^ Roundup 
000 0001 & 0001 000 á Denorm > norm à 


Solution to Problem 2.53 (page 125) 
In general, itis better to use a library macro rather than inventing your own code. 
This code seems to work on a variety of machines, however. 

We assume that the valué 1e400 overflows to infinity. 


#define POS INFINITY 1e400 
#define NEG INFINITY (-POS INFINITY) 
#define NEG ZERO (-1.0/POS. INFINITY) 


Solution to Problem 2.54 (page 125) 
Exercises such as this one help you develop your ability to reason about floating- 
point operations from a programmer's perspective. Make sure you understand 


each of the answers. 


A. x == (int) (double) x 
Yes, since double has greater precision and range than int. 


B. x == (int) (float) x 
No. For example, when x is TMax. 


C. d -- (double) (float) d ' 
No. For example, when'à is 1e40, we will get +20 on the right. 


D. f == (float) (double) -f 
Yes, since double has greater precision and range than float. 


E. f ==-(-f) 


Yes, since a floating-point number is negated by simply inverting its sign bit. 
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F. 1.0/2 == 1/2.0 
Yes, the numerators and denominators will both be converted to floating- 
point representations before the division is performed. 


G. d*d»- 0.0 
Yes, although it may overflow to --oc. 

H. (f+d)-f == 
No. For example, when £ is 1.0e20 and d is 1.0, the expression £*d will be 
rounded to 1.0620, and so the expression on the left-hand side will evaluate 
to 0.0, while the right-hand side will be 1.0. 
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| | Co execute machine code, sequences of bytes encoding the low-level 
| | operations that manipulate data, manage memory, read and write data on 
| storage devices, and communicate over networks. A compiler generates machine 
| code through a series of stages, based on the rules of the programming language, 
| the instruction set of the target machine, and the conventions followed by the op- 
| | erating system. The ccc C compiler generates its output in the form of assembly 
code, a textual representation of the machine code giving the individual instruc- 
tions in the program. Gcc then invokes both an assembler and a linker to generate 
| the executable machine code from the assembly code. In this chapter, we will take 
a close look at machine code and its human-readable representation as assem- 
bly code. 
When programming in a high-level language such as C, and even more so 
in Java, we are shielded from the detailed machine-level implementation of our 
| program. In contrast, when writing programs in assembly code (as was done in the 
| early days of computing) a programmer must specify the low-level instructions 
the program uses to carry out a computation. Most of the time, it is much more 
| productive and reliable to work at the higher level of abstraction provided by a 
i high-level language. The type checking provided by a compiler helps detect many 
' program errors and makes sure we reference and manipulate data in consistent 
ways. With modern optimizing compilers, the generated code is usually at least as 
efficient as what a skilled assembly-language programmer would write by hand. 
Best of all, a program written in a high-level language can be compiled and 
| | executed on a number of different machines, whereas assembly code is highly 
: | machine specific. 
So why should we spend our time learning machine code? Even though com- 
pilers do most of the work in generating assembly code, being able to read and 
understand it is an important skill for serious programmers. By invoking the com- 
piler with appropriate command-line parameters, the compiler will generate a file 
showing its output in assembly-code form. By reading this code, we can under- 
stand the optimization capabilities of the compiler and analyze the underlying 
inefficiencies in the code. As we will experience in Chapter 5, programmers seek- — 
ing to maximize the performance of a critical section of code often try different 
variations of the source code, each time compiling and examining the generated 
assembly code to get a sense of how efficiently the program will run. Furthermore, 
: : t, : 
there are times when the layer of abstraction provided by a high-level language 
hides information about the run-time behavior of a program that we need to under- 
stand. For example, when writing concurrent pro grams using a thread package, as 
covered in Chapter 12, it is important to understand how program data are shared 
or kept private by the different threads and precisely-how and where shared data 
| are accessed. Such information is visible at the machine-code level. As another 
example, many of the ways programs can be attacked, allowing malware to in- 
i fest a system, involve nuances of the way programs store their run-time control 
| i information. Many attacks involve exploiting weaknesses in system programs to 
i overwrite information and thereby take control of the system. Understanding how 
| these vulnerabilities arise and how to guard against them requires a knowledge of 
i the machine-level representation of programs. The need for programmers to learn 
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machine code has shifted over the years from one of being able to write programs 
directly in assembly code to one of being able to read and understand the code 
generated by compilers. 

In this chapter, we will learn the details of one particular assembly language 
and see how C programs get compiled into this form of machine code. Reading | 
the assembly code generated by a compiler involves a different set of skills than 
writing assembly code by hand. We must understand the transformations typical 
compilers make in converting the constructs of C into machine code. Relative to 
the computations expressed in the C code, optimizing compilers can rearrange | 
execution order, eliminate unneeded computations, replace slow operations with 
faster ones, and even change recursive computations into iterative ones. Under- 
standing the relation between source code and the generated assembly can often 
be a challenge—it's much like putting together a puzzle having a slightly differ- 
ent design than the picture on the box. It is a form of reverse engineering—trying 
to understand the process by which a system was created by studying the system 
and working backward. In this case, the system is a machine-generated assembly- 
language program, rather than something designed by a human. This simplifies 
the task of reverse engineering because the generated code follows fairly regu- 
lar patterns and we can run experiments, having the compiler generate code for 
many different programs. In our presentation, we give many examples and pro- 
vide a number of exercises illustrating different aspects of assembly language and 
compilers. This is a subject where mastering the details is a prerequisite to under- 
standing the deeper and more fundamental concepts. Those who say “I understand 
the general principles, I don't want to bother learning the details” are deluding 
themselves. It is crilical for you to spend time studying the examples, working 
through the exercises, and checking your solutions with those provided. 

Our presentation is based on x86-64, the machine language for most of the 
processors found in today’s laptop and desktop machines, as well as those that 
power very large data centers and supercomputers. This language has evolved 
over a long history, starting with Intel Corporation's first 16-bit processor in 1978, 
through to the expansion to 32 bits, and most recently to 64 bits. Along the way, 
features have been added to make better use of the available semiconductor tech- 
nology, and to satisfy the demands of the marketplace. Much of the development 
has been driven by Intel, but its rival Advanced Micro Devices (AMD) has also 
made important contributions. The result is a rather peculiar design with features 
that make sense only when viewed from a historical perspective. It is also laden 
with features providing backward compatibility that are not used by modern com- 
pilers and operating systems. We will focus on the subset of the fcatures used by 
acc and Linux. This allows us to avoid much of the complexity and many of the 
arcane features of x86-64. 

Our technical] presentation starts with a quick tour to show the rclation be- 
tween C, assembly code, and machine code. We then proceed to the details of 
x86-64, starting with the representation and manipulation of data and the imple- 
mentation of control. We see how control constructs in C, such as if, while, and 
switch statements, are implemented. We then cover the implementation of pro- 
cedures, including how the program maintains a run-time stack to support the 
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| 1A32, the 32-bit predecessor to x86-64, was'introdüced by. Initel 4n, 1985. It.$ervéd àŝithë machine 1 
' language of choice fot several decades? Most x86* microprocessors. Sold today, ‘and thost operating, 
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| | systems installed on these indchines, are desibnéd tó run x86-64- How weer, they ¢ can alsó execute IA32 | 
[] programs.in a backward compatibility mode. As a: result, many application programs ‘are’étill baséd on 1 
IA32. In addition, many existing’ systems cannot é&ecuté/x86-64, due-to'limiifations ‘of their. hardware ! 
Or system software. IA32 coritinues to bé'àn "iniportánt'inachine language You will find that havifg a | 
background in x86*64 will énable you'tó learn the JA32 machine language quite:readilys » 
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passing of data and control between procedures, as well as storage for local vari- 
ables. Next, we consider how data structures such as arrays, structures, and unions 
areimplemented at the machine level. With this background in machine-level pro- 
gramming, we can examine the problems of out-of-bounds memory references and 
the vulnerability of systems to buffer overflow attacks. We finish this part of the 
presentation with some tips on using the GDB debugger for examining the run-time 
behavior of a machine-level program. The chapter concludes with a presentation 
on machine-program representations of code involving floating-point data and 
operations. 

The computer industry has recently made the transition from 32-bit to 64- 
bit machines. A 32-bit machine can only make use of around 4 gigabytes (27? 
bytes) of random access memory, With memory prices dropping at dramatic 
rates, and our computational demands and data sizes increasing, it has become 
both economically feasible and technically desirable to go beyond this limitation. 
Current 64-bit machines can use up to 256 terabytes (2^8 bytes) of memory, and 
could readily be extended to use up to 16 exabytes (2 bytes). Although it is 
hard to imagine having a machine with that much memory, keep in mind that 
4 gigabytes seemed like an extreme amount of memory when 32-bit machines 
became commonplace in the 1970s and 1980s. 

Our presentation focuses on the types of machine-level programs generated 
when compiling C and similar programming languages targeting modern oper- 
ating systems. As a cónsequence, we make no attempt to describe many of the 
features of x86-64 that arise out of its legacy support for the styles of programs 
written in the early days of microprocessors, when much of the code was writ- 
ten manually and where programmers had to struggle with the limited range of 
addresses allowed by 16-bit machines. 


3.1 A Historical Perspective 


The Intel processor line, colloquially referred to as x86, has followed a long evo- 
| lutionary development. It started with one of the first single-chip 16-bit micropro- 
| cessors, where many compromises had to be made due to the limited capabilities 
ý of integrated circuit technology at the time. Since then, it has grown to take ad- 
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vantage of technology improvements as well as to satisfy the demands for higher 
performance and for supporting more advanced operating systems. 

The list that follows shows some models of Intel processors and some of their 
key features, especially those affecting machine-level programming. We use the 
number of transistors required to implement the processors as an indication of 
how they have evolved in complexity. In this table, “K” denotes 1,000 (10°), “M” 
denotes 1,000,000 (10°), and “G” denotes 1,000,000,000 (109). 


8086 (1978, 29 K transistors). One of the first single-chip, 16-bit microproces- 
sors. The 8088, a variant of the 8086 with an 8-bit external bus, formed 
the heart of the original IBM personal computers. IBM contracted with 
then-tiny Microsoft to develop the MS-DOS operating system. The orig- 
inal models came with 32,768 bytes of memory and two floppy drives (no 
hard drive). Architecturally, the machines were limited to a 655,360-byte 
address space—addresses were only 20 bits long (1,048,576 bytes address- 
able), and the operating system reserved 393,216 bytes for its own use. 
In 1980, Intel introduced the 8087 floating-point coprocessor (45 K tran- 
sistors) to operate alongside an 8086 or 8088 processor, executing the 
floating-point instructions. The 8087 established the floating-point model 
for the x86 line, often referred to as *x87." 


80286 (1982, 134 K transistors). Added more (and now obsolete) addressing 
modes. Formed the basis of the IBM PC-AT personal computer, the 
original platform for MS Windows. 


i386 (1985, 275 K transistors). Expanded the architecture to 32 bits. Added the 
flat addressing model used by Linux and recent versions of the Windows 
operating system. This was the first machine in the series that could fully 
support a Unix operating system. 


i486 (1989, 1.2 M transistors). Improved performance and integrated the float- 
ing-point unit onto the processor chip but did not si gnificantly change the 
instruction set. 


Pentium (1993, 3.1 M transistors). Improved performance but only added mi- 
nor extensions to the instruction set. 


PentiumPro, (1995, 5.5 M transistors). Introduced a radically new processor 
design, internally known as the P6 microarchitecture. Added a class of 
"conditional move" instructions to the instruction set. 


Pentium/MMX (1997, 4.5 M transistors). Added new class of instructions to the 
Pentium processor for manipulating vectors of integers. Each datum can 
be 1, 2, or 4 bytes long. Each vector totals 64 bits. 


Pentium II (1997, 7 M transistors). Continuation of the P6 microarchitecture. 


Pentium III (1999, 8.2 M transistors). Introduced SSE, a class of instructions for 
manipulating vectors of integer or floating-point data. Each datum can be 
1, 2, or 4 bytes, packed into vectors of 128 bits, Later versions of this chip 
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went up to 24 M transistors, due to the incorporation of the level-2 cache 

| on chip. 

| Pentium 4 (2000, 42 M transistors). Extended SSE to SSE2, adding new data 
types (including double-precision floating point), along with 144 new in- 
structions for these formats. With these extensions, compilers can use SSE 
instructions, rather than x87 instructions, to compile floating-point code. 
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Pentium 4E (2004, 125 M transistors). Added hyperthreading, a method to run 
two programs simultaneously on a single processor, as well as EM64T, 
Intel’s implementation of a 64-bit extension to IA32 developed by Ad- 
vanced Micro Devices (AMD), which we refer to as x86-64. 


Core 2 (2006, 291 M transistors). Returned to a microarchitecture similar to 
P6. First multi-core Intel microprocessor, where multiple processors are 
implemented on a single chip. Did not support hyperthreading. 


Core i7, Nehalem (2008, 781 M transistors). Incorporated both hyperthreading 
and multi-core, with the initial version supporting two executing pro- 
grams on each core and up to four cores on each chip. 


Core i7, Sandy Bridge (2011, 1.17 G transistors). Introduced AVX, an exten- 
sion of the SSE to support data packed into 256-bit vectors. 


Core i7, Haswell (2013, 1.4 G transistors). Extended AVX to AVX2, adding 
more instructions and instruction formats. 


Each successive processor has been designed to be backward compatible— 
able to run code compiled for any earlier version. As we will see, there are many 
strange artifacts in the instruction set due to this evolutionary heritage. Intel has 
had several names for their processor line, including 7432, for “Intel Architecture 
32-bit” and most recently Intel64, the 64-bit extension to IA32, which we will refer 
to as x86-64. We will refer to the overall line by the commonly used colloquial 
name “x86,” reflecting the processor naming conventions up through the i486. 

Over the years, several companies have produced processors that are com- 
patible with Intel processors, capable of running the exact same machine-level 
programs. Chief among these is Advanced Micro Devices (AMD). For years, 
AMD lagged just behind Intel in technology, forcing a marketing strategy where 
they produted processors that wére less expensive although somewhat lower in 
performance. They became more competitive around 2002, being the first to break 
the 1-gigahertz clock-speed barrier for a commercially available microprocessor, 
and introducing x86-64, the widely adopted 64-bit extension to Intel's IA32. Al- 
though we will talk about Intel processors, our presentation holds just as well for 
the compatible processors produced by Intel’s rivals. 

Much of the complexity of x86 is not of concern to those interested in programs 
for the Linux operating system as generated by the ccc compiler. The memory 
model provided in the original 8086 and its extensions in the 80286 became ob- 
solete with the i386. The original x87 fioating-point instructions became obsolete 
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If we plot the number of transistors in the different Intel processors versus the year of introduction, and 
use a logarithmic scale for the y-axis, we can see that the growth has been phenomenal. Fitting a line 
through the data, we see that the number of transistors,increases at an annual rate of approximately 
3796, meaning.thát the.number 'of transistors doubles about every 26 months. This growth has been 
sustained over the multiple-decade history of X86 microprocessors. : 

In 1965, Gordon Moore, a fotinder.of Intel Corporation, extrapolated from the chip technology of 
the day (by which they could fabricate circuits with around 64 transistors on a single chip) to predict 
that the number of transistors per chip would double every year for,the next 10-years. This prediction 
became known as Moore:s Law. As it turns out, his prediction was just a little bit optimistic, but also too 
short-sighted. Qver more'thari ^0 years, the semiconductor industry has been able to double transistor 
counts on average every 18 months. s à 

Similar expariential growth rates have'occurred for other aspects of computer technology, including 
the storage capacities of magnetic'disks and semiconductor memories. Thesé remarkable growth rates 
have been the major driving forces of the computer revolution. * i 
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with the introduction of SSE2. Although we see vestiges of the historical evolu- 
tion of x86 in x86-64 programs, many of the most arcane features of x86 do not 
appear. 


3.2 Program Encodings 


Suppose we write a C program as two files p1.c and p2.c. We can then compile 
this code using a Unix command line: 
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i linux? gcc -0g -o p pi.c p2.c 
i 


| | The command gcc indicates the ccc C compiler. Since this is the default compiler 
| on Linux, we could also invoke it as simply cc. The command-line option -0g! 

: instructs the compiler to apply a level of optimization that yields machine code 
| that follows the overall structure of the original C code. Invoking higher levels of 
optimization can generate code that is so heavily transformed that the relationship 
between the generated machine code and the original source code is difficult to 
understand. We will therefore use -Og optimization as a learning tool and then see 
what happens as we increase the level of optimization. In practice, higher levels 
of optimization (e.g., specified with the option -01 or -02) are considered a better 
choice in terms of the resulting program performance. 

The gcc command invokes an entire sequence of programs to turn the source 
code into executable code. First, the C preprocessor expands the source code to 
include any files specified with #include commands and to expand any macros, 
specified with #define declarations. Second, the compiler generates assembly- 
code versions of the two source files having names pi.s and p2.s. Next, the 
assembler converts the assembly code into binary object-code files p1.o and p2.o. 
Object code is one form of machine code—it contains binary representations of all 
| of the instructions, but the addresses of global values are not yet filled in. Finally, 
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the linker merges these two object-code files along with code implementing library 

functions (e.g., printf) and generates the final executable code file p (as specified 

by the command-line directive -o p). Executable code is the second form of 
| machine code we will consider—it is the exact form of code that is executed by 

the processor. The relation between these different forms of machine code and 
i the linking process is described in more detail in Chapter 7. 


3.2.1 Machine-Level Code 


As described in Section 1.9.3, computer systems employ several different forms 
of abstraction, hiding details of an implementation through the use of a simpler 
abstract model. Two of these are especially important for machine-level program- 
„ming. First, the format and behavior of a machine-level program is defined by the 
instruction set architecture, or ISA, defining the processor state, the format of the 
instructions, and the effect each of these instructions will have on the state. Most 
ISAs, including x86-64, describe the behavior of a program as if each instruction is 
executed in sequence, with one instruction completing before the next one begins. 
The processor hardware is far more elaborate, executing many instructions con- 
currently, but it employs safeguards to ensure that the overall behavior matches 
the sequential operation dictated by the ISA. Second, the memory addresses used 
by a machine-level program are virtual addresses, providing a memory model that 


a en Te Ha Eae 
3 mida rm Tr mess. a e i ti 


GNU compilers, will not recognize this option. For these, using optimization level one (specified with 
the command-line flag -01) is probably the best choice for generating code that follows the original 


| 1. This optimization level was introduced in occ version 4.8. Earlier versions of acc, as well as non- 
| program structure. 
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appears to be a very large byte array. The actual implementation of the mem- 
ory system involves a combination of multiple hardware memories and operating 
system software, as described in Chapter 9. 

The compiler does most of the work in the overall compilation sequence, 
transforming programs expressed in the relatively abstract execution model pro- 
vided by C into the very elementary instructions that the processor executes. The 
assembly-code representation is very close to machine code. Its main feature is 
that it is in a more readable textual format, as compared to the binary format of 
machine code. Being able to understand assembly code and how it relates to the 
original C code is a key step in understanding how computers execute programs. 

The machine code for x86-64 differs greatly from the original C code. Parts of 
the processor state are visible that normally are hidden from the C programmer: 


* Theprogram counter (commonly referred to as the PC, and called %rip in x86- 
64) indicates the address in memory of the next instruction to be executed. 


* The integer register file contains 16 named locations storing 64-bit values. 
These registers can hold addresses (corresponding to C pointers) or integer 
data. Some registers are used to keep track of critical parts of the program 
state, while others are used to hold temporary data, such as the arguments 
and local variables of a procedure, as well as the value to be returned by a 
function. * 


The condition code registers hold status information about the most recently 
executed arithmetic or logical instruction. These are used to implement con- 
ditional changes in the control or data flow, such as is required-to implement 
if and while statements. 


* A set of vector registers can each hold one or more integer or floating-point 
values. 


Whereas C provides a model in which objects of different data types can be 
declared and allocated in memory, machine code views the memory as simply 
a large byte-addressable array. Aggregate data types in C such as arrays and 
structures are represented in machine code as contiguous collections of bytes. 
Even for scalar data types, assembly code makes no distinctions betwéen signed or 
unsigned integers, between different types of pointers, or even between pointers 
and integers. 

The program memory contains the executable machine code for the program, 
some information required by the operating system, a run-time stack for managing 
procedure calls and returns, and blocks of memory allocated by the user (e.g., by 
using the malloc library function). As mentioned earlier, the program memory 
is addressed using virtual addresses. At any given time, only limited subranges of 
virtual addresses are considered valid. For example, x86-64 virtual addresses are 
represented by 64-bit words. In current implementations of these machines, the 
upper 16 bits must be set to zero, and so an address can potentially specify a byte 
over a range of 2%, or 64 terabytes. More typical programs will only have access 
to afew megabytes, or perhaps several gigabytes. The operating system manages 





OIM corel a ae 


ae 





172  Chapter3 Machine-Level Representation of Programs 


" LM basta a > Q8 
Aside The ever-changing forms, of sjeneiated cade. 5 oy La: ^» oe 


1 
-In our presentation, we^will'show the code generated by a particular versõbmòf *6cc with particulat 
settings of the command-line options. Tt you compile éode.óri your own machine, Chances aré you Will be 
using a different compiler ora different version of Gcc and hénce will generate different code: Theópen- 1 

s Source community 'Süpporting tcc Kéeps changing the ĉode gerierator, attempting to*gehérate’ more 
efficient code according'to “chariging code guidelines ‘providéd by, the miéroprocessor manufacturers 
Our goal i in studying: the’examples shostn in our presehtation is:to demonstrate how, to examine’ 1 
assembly code and map'it batk to the constructs found: in high- level} programming languages. You will | 
é 


» need to adapt thesestechniques to tlie style of, code geiiétated by your patticula? corhpiler. * an 
porq c rr I S PDE OO EERE feli nind. ae aire ades Pd 


* 


Ll Ld els 


this virtual address space, translating virtual addresses into the physical addresses 
of values in the actual processor memory. 

A single machine instruction performs only a very elementary operation. For 
example, it might add two numbers stored iri registers, transfer data between 
memory and a register, or conditionally branch to a new instruction address. The 
compiler must generate sequences of such instructions to implement program 
constructs such as arithmetic expression evaluation, loops, or procedure calls and 
returns. 


3.2.2 Code Examples 


Suppose we write a C code file mstore.c containing the following function defi- 
nition: 


long mult2(long, long); 


void multstore(long x, long y, long *dest) ( 
long t = mult2(x, y); 
*dest - t; 


To see the assembly code generated by the C compiler, we can use the, -S 
option on the command line: 


linux? gcc -Og -S mstore.c 


This will cause ccc to run the compiler, generating an assembly file nstoxe.s, 
and go no further. (Normally it would then invoke the assembler to genérate an 
object-code file.) 

The assembly-code filé contains various declarations, including the following 
set of lines: : 


multstore: 
pushq %rbx j 
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To display-the binary object code'fora program (say, mstore),-we use a disassembler (described below) 
, to determine that the code for the-procedure is 14 bytes long. Then-we-run.the-GNU debugging tool 
x SDB on file nstore.o and give it the command m 2 


; (gdb) x/14xb multstore 


B wf 

* telling it to display (abbreviated ‘x 14 hex-formatted (also,‘x’) bytes (P3 starting at the address where 

* function multstore js located. You will find that GRB has many useful features for,analyzing machine, 
level prográms, as will be discussed in Section 3.10.2. ' 
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movq Ardx, %rbx 
call mult2 

movq 4rax, (%rbx) 
popq Arbx 

ret 


Each indented line in the code corresponds to a single machine instruction. For 
example, the pushq instruction indicates that the contents of register 4rbx should 
be pushed onto the program stack. All information about local variable names or 
data types has been stripped away. 

If we use the -c command-line option, ‘ccc will both compile and assemble 
the code 


linux? gcc -0g -c mstore.c 


This will generate an object-code file.mstore .o that is in binary format and hence 
cannot be viewed directly. Embedded within the:4 ,368 bytes of the file mstore.o 
is a 14-byte sequence with the hexadecimal representation 


53 48 89 d3 e8 00 00 00 00 48 89 03 5b c3 


Thisis the object code corresponding to the assembly instructions listed previously. 
A key lesson to learn from this is that the program executed by the machine is 
simply a sequence of bytes encoding a seties of instructions. The machine has 
very little information about the source code from which these instructions were 
generated. 

To inspect the contents of machine-code files, a class of programs known as 
disassemblers can be invaluable. These programs generate a format similar to 
assembly code from the machine code. With Linux systems, the program OBJIDUMP 
(for “object dump") can serve this role given the -á command-line flag: 


linux» objdump -d mstore.o 


The result (where we have added line numbers on the left and annotations in 
italicized text) is as follows: 
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Disassembly of function sum in binary file mstore.o 

0000000000000000 <multstore>: 

Offset Bytes Equivalent assembly language 
0: 53 push %rbx 

48 89 d3 mov Vrdx,Arbx 

e8 00 00 00 00 callq 9 <multstore+0x9> 

48 89 03 mov Vrax, (%rbx) 

5b pop Arbx 

c3 retq 


- 
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[ 
k Several features about machine code and its disassembled representation are 
5 worth noting: 


On the left we see the 14 hexadecimal byte values, listed in the byte sequence 
Pd shown earlier, partitioned into groups of 1 to 5 bytes each. Each of these groups 
is a single instruction, with the assembly-language equivalent shown on the right. 





i e x86-64 instructions can range in length from 1 to 15 bytes. The instruction 
3 encoding is designed so that commonly used instructions and those with fewer 


with more operands. 


d operands require a smaller number gf bytes than do less common ones or ones 


è The instruction format is designed in such a way that from a given starting 


$ | position, there is a unique decoding of the bytes into machine instructions. 


E 1 For example, only the,instruction pushq %rbx can start with byte value 53. 


| i assembly-code versions of the program. 


[ e The disassembler determines the assembly code based purely on the byte 
3 1 sequences in the machine-code file. It does not require access to the source or 


l i e The disassembler uses a slightly different‘naming convention for the instruc- 
E tions than does the assembly code generated by acc. In our example; it has 


1 safely be omitted. 


j omitted the suffix ‘q’ from many of the instructions. These suffixes are size 
1 designators and can be omitted in most cases. Conversely, the disassembler 
D adds the suffix *q' to the ca11 and ret instructions. Again, these suffixes can 


Generating the actual executable code requires running a linker on the set 


1 

| , 

i | of object-code files, one of which must contain a function main. Suppose in file 
: main.c we had the following function: 


#include <stdio.h> 


void multstore(long, long, ‘long *); 


i 
I 
B int mainO { 
H long d; 


multstore(2, 3, &d); 
printf£("2 * 3 --> %ld\n", d); 
return 0; 
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long mult2(long a, long b) { 
long s = a * b; 
return $; 


} 
Then we could generate an executable program prog as follows: 
linux» gcc -0g -o prog main.c mstore.c 


The file prog has grown to 8,655 bytes, since it contains not just the machine 
code for the procedures we provided but also code used to start and terminate 
the program as well as to interact with the operating system. 

We can disassemble the file prog: 


linux» objdump -d prog 
The disassembler will extract various code sequences, including the following: 


Disassembly of function sum in binary file prog 
0000000000400540 «multstore»: 


1 

2 400540: 53 push %rbx 

3 400541: 48 89 q3 mov Xrdx,Vrbx 

4 400544: e8 42 00 00 00 callq 40058b «mult2» 
5 400549: 48 89 03 mov Arax, (%rbx) 

6 40054c: 5b pop 4rbx 

7 40054d: c3 retq 

8 40054e: 90 nop 

9 40054f: 90 nop 


This code is almost identical to that generated by the disassembly of mstore. c. 
One important difference is that the addresses listed along the left are different— 
the linker has shifted the location of this code to a different range of addresses. A 
second difference is that the linker has filled in the address that the callq instruc- 
tion should use in calling the function mu1t2 (line 4 of the disassembly). One task 
for the linker is to match function calls with the locations of the executable code for 
those functions. A final difference is that we see two additional lines of code (lines 
8-9). These instructions will have no effect on the program, since they occur after 
the return instruction (line 7). They have been inserted to grow the code for the 
function to 16 bytes, enabling a better placement of the next block of code in terms 
of memory system performance. 


3.2.3 Notes on Formatting 


The assembly code generated by acc is difficult for a human to read. On one hand, 
it contains information with which we need not be concerned, while on the other 
hand, it «does not provide any description of the program or how it works. For 
example, suppose we give the command 


linux» gcc -Og -S mstore.c 
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to generate the file nstore.s. The full content of the file is as follows: 







"010-mstore.c" 





.file 
.text 
.globl multstore 

l .type  mültstore, @function 

E multstore: 

r j pushq *rbx 

1 movq Ardx, %rbx 

| call mult2 

| movq 4rax, (érbx) 
















popd &%rbx 







| ret 
il .Size multstore, .-multstore | 
[ .ident "GCC: (Ubuntu 4.8.1-2ubuntui-12.04) 4.8.1" 
] | . section -note .GNU-stack,"" ,@progbits 











All of the lines beginning with ‘.’ are directives to guide the assembler and 
linker. We can generally ignore these. On the other hand, there are no explanatory 
remarks about what the instructions do or how they relate to the source code. 
M To provide a clearer presentation of assembly code, we will show it in a form 
1 that omits most of the directives, while including line numbers and explanatory 
annotations. For our example, an annotated version would appear as follows: 












i | void multstore(long x, long y, long *dest) 
x in Zrdi, y in /rsi, dest in frdx | 





i 1 multstore: I 
fi 2 pushq #rbx Save %rbx al 
| 3 movq rdx, Arbx Copy dest to %rbx Nu 
[ 4 call  mult2 Call mult2(x, y) 5 
E 5 movq Ay ax, (%rbx) Store result at *dest 
$ popq Arbx Restore frbx ,. 
7 ret Return i 


L, We typically show only the lines of code relevant to the point being discussed. 
i Each line is numbered on the left for reference and annotated on the right by a 
1 brief description of the effect of the instruction and how it relates to the computa- 
tions of the original C code. This is a stylized version of the way assembly-language 
programmers format their code. 

We alsoprovide Web asides to cover material intended for dedicated machine- 
language enthusiasts. One Web aside describes IA32 machine code. Having à E 
background in x86-64 makes learning JA32 fairly simple. Another Web aside gives M 
a brief presentation of ways to incorporate assembly code into,C. programs. For $i 
some applications, the programmer must drop down.to assembly code to access f 
low-level features of the machine. One approach is to write entire functions in — 28 
assembly code and combine them with C functions during the linking stage. A 
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Aside ATT versus Intel assembly-code formats 


In our presentation, we show assembly code in ATT, format (named after AT&T, the company .that 
operated Bell Laboratories for many years), the default format for GCC, oBjDUMP, and the other tools we 
will consider. Other programming -tQols, includipg those from Microsoft as well as the documentation 
from-Intel, show assembly code in Intel format. ‘The two formats differ in a number of ways. As an 
example, ccc can generate cdde in Intel format for the sum function using the following command line: 


Linu? gcc ~0¢ +S" -masm=intel mstore.c 


This gives the following assembly code: 


multstore: . i * à 
ë push rbx ^ 3 : 
mov rbx, rdk 
"call mult2 
moy? .— QWORD PTR: [rbx], rax' * ^ 
pop rbx > i 
ret x [» 


We see that.the'Ínte) and ATT formats différ in the following ways: 


* The Intel code pmits the size designation suffixes, We see instruction push andmov instead of pushq 


and movg.. T 
By, 


* The Intel code omits the Ap " chárácter i in front of register names, using rbxinstead of <rbx. 
s Thé-Intel code has a different way of. describihg locations in memory—for éxample, QWORD PTR 


"Lrbx] rather than (%rbx). fs E 
* Instructions With multiple operands list them in the reversé sé order. This càn be very confusing when, 
switching between fhetwo formats. ^ > 


5 he 


Although we-will not be^ using Intel format in our presentation, you will encountéy it in documentation 
from Intel and Microsoft’ ^ 
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4 


second is to use GcC's support for embedding assembly code directly within C 
programs. 


3.3 Data Formats 


Due to its origins as a 16-bit architecture that expanded into a 32-bit one, Intel 
uses the term ^word" to refer to a 16-bit data type. Based on this, they refer to 32- 
bit quantities as “double words," and 64-bit quantities as “quad words." Figure 3.1 
shows the x86-64 representations used for the primitive data types of C. Standard 
int values are stored as double words (32 bits). Pointers (shown here as char *) 
are stored as 8-byte quad words, as would be expected in a 64-bit machine. With 
x86-64, data type long is implemented with 64 bits, allowing a very wide range 
of values. Most of our code examples in this chapter use pointers and long data 
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Web Aside ASM:EASM = Combining assembly códe With C'programs — : : 


Although a C compiler does'a good job of convérting the computations expressed in a program into 
machine code, there are some features of a machine that cannot be accessed'by' a, C program. For 
exaniple, every time an x86-64 processor executes an-afithmetic or logical operation, it sets a 1-bit ; 
condition code flag, named PF (for “parity flag"), to 1 when the lower 8 bits in the resulting computation i 
have an even number of ones and to 0 otherwise; Computing this information in C requires at least ; 
seven shifting, masking, and EXCLUSIVE-OR operations (see Problem 2.65). Ever’ though the hardware l 
performs this computation as part of'every arithmetic or logical operation, there is no way fora C i 
„program to determine the value of the PF condition code flag. This task cari readily be performed by i 
incorporating a small number of assembly-code instructions into the program. i 
There are two ways to incorporate assembly code into C programs. First, we can.write an-entire : 
function as a separate 'assembly-code file and let the assembler and linker combine'this,with code we " 
have written in C. Second, we can use the inline assembly feature of ccc, where brief sections of assembly , 
code can be incorporated into á C prograt using the asm directive. This approach has the advantage 
that it minimizes the amount of machine-specific code. i 
Of course, including assembly code in a C program makes the code specific to a particular class of j 
machines (such as x86-64), and so it should only be used when the desjred feature, can only be accessed | 
i 
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C declaration Intel data type Assembly-code suffix Size (bytes) 
char Byte b 1 
Short Word W 2 
int Double word 1 4 
long Quad word q 8 
char * Quad word q 8. 
float Single precision s 4 
double Double precision 1 8 


Figure 3.1 Sizes of C data types in x86-64. With a 64-bit machine, pointers are 8 bytes 
long. 


types, and so they will operate on quad words. The x86-64 instruction set includes 
a full complement of instructions for bytes, words, and double words as well. 
Floating-point numbers come in two principal formats: single-precision (4- 
byte) values, corresponding to C data type float, and double-precision (8-byte) 
values, corresponding to C data type double. Microprocessors in the x86‘family 
historically implemented all floating-point operations with a special 80-bit (10- 
byte) floating-point format (see Problem 2.86). This format can be specified in C 
programs using the declaration long double. We recommend against using this 
l format, however. It is not portable to other classes of machines, and it is typically 
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not implemented with the same high-performance hardware as is the case for 
single- and double-precision arithmetic. 

As the table of Figure 3.1 indicates, most assembly-code instructions gener- 
ated by Gcc have a single-character suffix denoting the size of the operand. For 
example, the data movement instruction has four variants: movb (move byte), 
movw (move word), movl (move double word), and movq (move quad word). The 
suffix ‘I’ is used for double words, since 32-bit quantities are considered to be 
“long words,” The assembly code uses the suffix *1' to denote a 4-byte integer as 
well as an 8-byte double-precision floating-point number. This causes no ambigu- 
ity, since floating-point code involves an entirely different set of instructions and 
registers. 


3.4 Accessing Information 


An x86-64 central processiug unit (CPU) contains a set of 16 general-purpose 
registers storing 64-bit values. These registers are used to store integer data as well 
as pointers. Figure 3.2 diagrams the 16 registers. Their names all begin with Ar, but 
otherwise follow multiple different naming conventions, owing to the historical 
evolution of the instruction set. The original 8086 had eight 16-bit registers, shown 
in Figure 3.2 as registers Zax through %bp. Each had a specific purpose, and hence 
they were given names that reflected how they were to be used. With the extension 
to 1A32, these registers were expanded to 32-bit registers, labeled Zeax through 
%ebp. In the extension to x86-64, the original eight registers were expanded to 64 
bits, labeled Zrax through %rbp. In addition, eight new registers were added, and 
these were given labels according to a new naming convention: 4r8 through 4r15. 

As the nested boxes in Figure 3.2 indicate, instructions can operate on data 
of different sizes stored in the low-order bytes of the 16 registers. Byte-level 
operations can access the least significant byte, 16-bit operations can access the 
least significant 2 bytes, 32-bit operations can access the least significant 4 bytes, 
and 64-bit operations can access entire registers. 

In later sections, we will present a number of instructions for copying and 
generating 1-, 2-, 4-, and 8-byte values. When these instructions have registers as 
destinations, two conventions arise for what happens to the remaining bytes in 
the register for instructions that generate less than 8 bytes: Those that generate 1- 
or 2-byte quantities leave the remaining bytes unchanged. Those that generate 4- 
byte quantities set the upper 4 bytes of the register to zero. The latter convention 
was adopted as part of the expansion from IA32 to x86-64. 

As the annotations along the right-hand side of Figure 3.2 indicate, different 
registers serve different roles in typical programs. Most unique among them is the 
stack pointer, 4rsp, used to indicate the end position in the run-time stack. Some 
instructions specifically read and write this register. The other 15 registers have 
more flexibility in their uses. A small number of instructions make specific use of 
certain registers. More importantly, a set of standard programming conventions 
governs how the registers are to be used for managing the stack, passing function 
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Return value 


Callee saved 


4th argument 


3rd argument 


2nd argument 


1st argument 


Callee saved 


Stack pointer 


5th argument 


6th argument 


Caller saved 


Caller saved 


Callee saved 


Callee saved 


Callee saved 


Callee saved 


Figure3.2 Integer registers. The low-order portions of all 16 registers can be accessed 
as byte, word (16-bit), double word (32-bit), and quad word (64-bit) quantities. 


arguments, returning values from functions, and storing local and temporary data. 
` We will cover these conventions in our presentation, especially in Section 3.7, 


where we describe the implementation of procedures. 


3.4.1 Operand Specifiers 


Most instructions have one or more operands specifying the source values to use 
in performing an operation and the destination location into which to place the 
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Type Form Operand value Name 

Immediate — $/mm Imm Immediate 

Register Ta R[r,] Register 

Memory Imm Mmm] Absolute 

Memory (rq) M[R[r,]] Indirect 

Memory Imm(r,) M[Jmm + R[r,]] Base + displacement 
Memory Gr xj) M[R[r,] + R[z;]À Indexed 

Memory Imm(r,,r;) Mmm + R[rg] + R[z;T] Indexed 

Memory Gri, M[R[r;]- s] Scaled indexed 
Memory Imm(,r;,s) M[Imm + R[z;] s] Scaled indexed 
Memory (rj,rj,s) M[R[z;] + R[r;]- s] Scaled indexed 
Memory Imm(r,,r;,s) M[Imm --R[r;]--R[r;] s] Scaled indexed 
Bind RUPEM CM Sero ri LOHN. cM MLB pL ER cut t NE dO Er Ut RN 


Figure3.3 Operand forms. Operands can denote immediate (constant) values, register 
values, or values from memory. The scaling factor s must be either 1, 2, 4, or 8. 


result. x86-64 supports a number of operand forms (see Figure 3.3). Source values 
can be given as constants or read from registers or memory. Results can be stored 
in either registers or memory. Thus, the different operand possibilities can be 
classified into three types. The first type, immediate, is for constant values. In ATT- 
format assembly code, these are written with a ‘$’ followed by an integer using 
standard C notation—for example, $-577 or $0x1F. Different instructions allow 
different ranges of immediate values; the assembler will automatically select the 
most compact way of encoding a value. The second type, register, denotes the 
contents of a register, one of the sixteen 8-, 4-, 2-, or 1-byte low-order portions of 
the registers for operands having 64, 32, 16, or 8 bits, respectively. In Figure 3.3, 
we use the notation r, to denote an arbitrary register a and indicate its value with 
the reference R[r,], viewing the set of registers as an array R indexed by register 
identifiers. 

The third type of operand is a memory reference, in which we access some 
memory location according to a computed address, often called the effective ad- 
dress. Since we view the memory as a large array of bytes, we use the notation 
M,[Adadr] to denote a reference to the b-byte value stored in memory starting at 
address Addr. To simplify things, we will generally drop the subscript 5. 

As Figure 3.3 shows, there are many different addressing modes allowing dif- 
ferent forms of memory references. The most general form is shown at the bottom 
of the table with syntax Imm (xr, ,xj ,s). Such a reference has four components: an 
immediate offset Imm, a base register r,, an index register r;, and a scale factor 
s, where s must be 1, 2, 4, or 8. Both the base and index must be 64-bit registers. 


The effective address is computed as Imm + R[r,] + R[r;]- s. This general form is - 


often seen when referencing elements of arrays. The other forms are simply spe- 
cial cases of this general form where some of the components are omitted. As we 
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will see, the more complex addressing modes are useful when referencing array 
and structure elements. 


Brae AERC OSEE REY IS 4 vem 
Assum at the indicated memory addresses and 
registers: 


Address Value Register Value 
unte cA sod Pr hpc. co cce c UM cai 


0x100 OxFF rax 0x100 
0x104 OxAB rcx Oxi 
0x108 0x13 Ardx 0x3 
0x10C Oxii 


Fill in the following table showing the values for the indicated operands: 


Operand Value 


%rax ag 
0x104 

$0x108 

(%rax) 

4(%rax) 

9(%rax,%4rdx) 

260 (4 rcx, %rdx) 
OxFC(,%rex,4) 

(%rax, %rdx ,4) 


3.4.2 Data Movement Instructions 


Among the most heavily used instructions are those-that copy data from one lo- 
cation to another. The generality of the operand -notation allows a simple data 
movement instruction to express a range of possibilities that in many machines 
would require a number of different instructions. We present a number of differ- 
ent data movement instructions, differing in their source and destination types, 
what conversions they perform, and other side effects they may have. In our pre- 
sentation, we group the many different instructions into instruction classes, where 
the instructions in a class perform the same operation but with different operand 
sizes. 

Figure 3.4 lists the simplest form of data movement instructions—Mov class. 
These instructions copy data from ‘a source location to a destination location, 
without any transformation. The class consists of four instructions: movb, movw, 

- movl, and movq.:All four of these ‘instructions have similar effects; they differ 
primarily in that they operate on data of different sizes: 1, 2, 4, and 8 bytes, 
respectively. 
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Instruction Effect Description 
MOV $,D D «S Move 

movb Move byte 

movw Move word 

movl Move double word 

movq Move quad word 
movabsq I, R Ref Move absolute quad word 


Figure 3.4 Simple data movement instructions. 


The source operand designates a value that is immediate, stored in a register, 
orstored in memory. The destination operand designates a location that is either a 
register or a memory address. x86-64 imposes the restriction that a move instruc- 
tion cannot have both operands refer to memory locations. Copying a value from 
one memory location to another requires two instructions—the first to load the 
source value into a register, and the second to write this register value to the des- 
tination. Referring to Figure 3.2, register operands for these instructions can be 
the labeled portions of any of the 16 registers, where the size of the register must 
match the size designated by the last character of the instruction (‘b’, ‘w’, ‘1’, or 
‘q’). For most cases, the Mov instructions will only update the specific register bytes 
or memory locations indicated by the destination operand. The only exception is 
that when mov1 has a register as the destination, it will also set the high-order 4 
bytes of the register to 0, This exception arises from the convention, adopted in 
x86-64, that any instruction that generates a 32-bit value for a register also sets the 
high-order portion of the register to 0. 

The following Mov instruction examples show the five possible combinations 
of source and destination types. Recall that the source operand comes first and 
the destination second. 


1 movl $0x4050,%eax Immediate--Register, 4 bytes 
2 movw Abp,ASp Register--Register, 2 bytes 
3 movb (%rdi,%rcx),%al Memory--Regi ster, 1 byte 
4 movb $-17, (Zesp) Immediate--Memory, 1 byte 
5 movg %4rax,-12(%rbp) Register--Memory, 8 bytes 


A final instruction documented in Figure 3.4 is for dealing with 64-bit imme- 
diate data. The regular movq instruction can only have immediate source operands 
that can be represented as 32-bit two's-complement numbers. This value is then 
sign extended to produce the 64-bit value for the destination. The movabsq in- 
struction can have an arbitrary 64-bit immediate value as its source operand and 
can only have a register as a destination. 

Figures 3.5 and 3.6 document two classes of data movement instructions for 
use when copying a smaller source value to a larger destination. All of these 
instructions copy data from a source, which can be either a register or stored 
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Aside Understanding how data movement chariges a destifiation register 





tions modify the upper bytes of a destination register. This distinction is illustrated by the following : 


i ot 
AS described, there are two different convention’ regarding whether and how data movement instruc- 
code sequence: d 
ty à 
| movabsq $0x0011223344556677, %rax ^ Jrax*- 0011223344556677 
movb $-t, Zal Yrax = 00112233445566FF  * E 
movw $-1 hax Yrax = 001122334465FFFF* s - 
Ma , 4eax žrax = > Q0000000ÉFFFFFFF 


f 

1 

x. ,9 

Xrax < EFFEFFFEFFFERFEE * i 





movl 
movq , #rax 


mh ww hh Ms, 


In the MM we use hexadecimal notation. In fhe example, the instrüction on fife 1 
initializes register Xrax to the pattern 0011223344556677. The remaining instructions have immediate. * i 

value —1 as their source values, Recall that the.hexadecimal fepresentation óf —1 isof the form FF-.-;F, 

where the number of F's istwice the nümber of bytes inthe representation. The movb instruction{line2) | 

1 therefore sets the low-order byte of-4rax to FF; while thé movw instruction (line 9) sets the low-order 4 
` 2 bytes to FFFF, with the remaining bytes unchanged, The mov} instruction (line 4) sets theJow-qrder i 
ł 
è 













4 bytes to FFFFFFFF, but it also sets the high-arder'4 bytes to 00000000. Finally, the moya instruction 
(line 5) sets the complete register to FFFFFFFFFFFERFFF. "HD a wc, 


| Vow w te We x VAE e T Rea Mee woven axe — anne tere UP pact eme. Nescis tl * 






Description 





Instruction Effect 


1 MOVZ S,R R « ZeroExtend(S) Move with zero extension 
Move zero-extended byte to word 









movzbw 

movzbl Move zero-extended byte to double word 
movzwl Move zero-extended word to double word 
movzbq Move zero-extended byte to quad word 






Move zero-éxtended word to quad word 







movzwq 






Figure 3.5 Zero-extending data movement instructions. These instructions have a 
register or memory location as the source and a register as the destination. 







^ 


in memory, to a register destination. Instructions in the Movz class fill out the 
remaining bytes of the destination with zeros, while those in the Movs class fill 
them out by sign extension, replicating copies of the most significant bit of the 
source operand. Observe that each instruction name has size designators as its 
final two characters—the first specifying the source size, and the second specifying 
the destination size. As can be seen, there are three instructions in each of these 
classes, covering all cases of 1- and 2-byte source sizes and 2- and 4-byte destination 
sizes, considering only cases where the destination is larger than the source, of 


course. 
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| Instruction Effect Description 


Movs S,R R « SignExtend(S) Move with sign extension 
movsbw Move sign-extended byte to word 
movsbl Move sign-extended byte to double word 
movswl Move sign-extended word to double word 
movsbq Move sign-extended byte to quad word 
movswq Move sign-extended word to quad word 
movslq Move sign-extended double word to quad word 
cltq *rax <+ SignExtend(Aeax) ^ Sign-extend %eax to 4rax 


Figure3.6 Sign-extending data movement instructions. The Movs instructions have 
a register or memory location as the source and a register as the destination. The cltq 
instruction is specific to registers 4eax and 4rax. 


Note the absence of an explicit instruction to zero-extend a 4-byte source 
value to an 8-byte destination in Figure 3.5. Such an instruction would logically 
be named movzlq, but this instruction does not exist. Instead, this type of data 
movement can be implemented using a mov1 instruction having a register as the 
destination. This technique takes advantage of the property that an instruction 
generating a 4-byte value with a register as the destination will fill the upper 4 
bytes with zeros. Otherwise, for 64-bit destinations, moving ‘with sign extension is 
supported for all three source types, and moving with zero extension is supported 
for the two smaller source types. 

Figure 3.6 also documents the cltq instruction. This instruction has no 
operands—it always uses register {eax as its source and %rax as the destination for 
the sign-extended result. It therefore has the exact same effect as the instruction 
movslq Zeax, Arax, but it has a more compact encoding. 





instruction suffix based on the operands. (For example, mov can be rewritten as 
movb, movw, nov1, or movq.) 


mov . %eax, (%rsp) 
mov. (rax), %dx 

mov._ $0xFF, Abl 

mov... (&rsp,4rdx,4), %dl 
mov... (%rdx), %rax 


mov |  %dx, (%rax) 
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y 
| Aside Comparing byte movement instructions 


The following example illustrates how different data movement instructions either do or do not change 
the high-order bytes of the destination. Observe that the three byte-movement instructions movb, 
| r novsba, and movzbg differ from'each other in subtle ways. Here is an example: 





1 1 movabsq $0x0011223344556677, %rax Arax 7 0011223344556677 
| 2 novb $0xAA, %dl Adl = AA t 
3 movb %dl,%al rax = 0011223344556644 
* 4 movsbq 4dl, rax Zrax = FFFFFFFFFFFFFFAA i 
} 5 movzbg 4d1,%rax "xax = 000000000000004A 
In the following discussion, we use hexadecimal. notation- for all pf the values. The first two lines E 
| of the code initialize registers rax and %d1 to 0011223344556677 and AA, respectively. The remaining 
instructions all copy the low-order byte of %rdx to the low-order byte of Yrax. The movb instruction , 
| (line 3) does not change the other bytes. The movsbq instruction (line 4) sets the other 7 bytes to , 
either all ones or all zeros depending on the high-order bit of the source byte. Since hexadecimal A 
i represents binary value 1020, sign extension causes the higher-order bytes to eåch be set to FF. The , 
movzbq instruction (line 5) always sets the other 7.bytes to zéro. 
t E ibs, a ym * » * wo 
"Pre > ité CEA Tox ON pag 326), Ag weh e M teet 
Practice ProbIéff. 3.3. (OIHAN RIIE 3201. Sek nia e 
l pu of the following lines of xS generates an error message when we invoke 
| the assembler. Explain what is wrong with each line. 
1] 
! movb $0xF, (%ebx) 
| movl %rax, (Arsp) 
i movw (%rax) ,4(4rsp) 
movb %al,%sl 
f movq Arax,$0x123 
movl %eax,%rdx 
! movb 4si, 8(Xrbp) 
i 
3.4.3 Data Movement Example 
f 
i As an example of code that uses data movement instructions, consider the data 


i exchange routine shown in Figure 3.7, both as C code and as assembly code 
; generated by ccc. 
i As Figure 3.7(b) shows, function exchange is implemented with just three 
a instructions: two data movements (movq) plus an instruction to return back to 
! the point from which the function was called (ret). We will cover the details of 
l function call and return in Section 3.7. Until then, it suffices to say that arguments 
l are passed to functions in registers. Our annotated assembly code documents 
these. A function returns a value by storing it in register %rax, or in one of the 
| low-order portions of this register. 
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(a) C code 
long exchange(long *xp, long y) 
1 
long x = *xp; 
*Xp 7 y; 
return x; 
} 
(b) Assembly code 
long exchange(long *xp, long y) 
Xp in rdi, y in %rsi 
1 exchange: = 
2 movq (Ardi), %rax Get x at xp. Set as return value. 
3 movq 4rsi, (%rdi) Store y at xp. 
4 ret Return. 


Figure 3.7 C and assembly code for exchange routine. Registers rdi and %rsi 
hold parameters xp and y, respectively. 


When the procedure begins execution, procedure parameters xp and y are 
stored in registers %rdi and %rsi, respectively. Instruction 2 then reads x from 
memory and stores the value in register %rax, a direct implementation of the 
operation x  *xp in the C program. Later, register %rax will be used to return 
a value from the function, and so the return value will be x. Instruction 3 writes y 
to the memory location designated by xp in register %rdi, a direct implementation 
of the operation *xp = y. This example illustrates how the Mov instructions can be 
used to read from memory to a register (line 2), and to write from a register to 
memory (line 3). 

Two features about this assembly code are worth noting. First, we see that what 
we call “pointers” in C'are simply átldresses. Dereferencing a pointer involves 
copying that pointer into a register, and theh using this register in a memory 
referencé. Second, local variables such as x are often kept in registers rather than 
stored in memory locations. Register access is much faster than memory access. 










($6 Jtto Dac ) 6). 


and dp are declared with types 





src_t ¥*sp; 
dest_t *dp; 


where src_t and dest_t are data types declared with typedef. We wish to use 
the appropriate pair-of data movement instructions to implement the operation 


*dp = (dest t) *sp; 
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New to C? Some examples of pointers 


Function exchange (Figure 3.7(a)) provides a good illustration of the use of pointers in C. Argüment f 
xp is a pointer to a long integer, while y is a long integer itself. The statement ; 


long x = *Xp; 
indicates that we should read the value stored in the location designated by xp and store it as a local 
variable named x. This'read operation is known as pointer dereferencing. The C operator ‘+’ performs 
d $ 


pointer dereferencing. 
The statement 


*xp = y; 
does the reverse—it writes the value of parameter y at the location designated by xp. This.is also a form 
of pointer dereferencing (and hence the operator *), but it indicates a write operation since it is on the. 
left-hand side of the assignment. 
The following is ‘an example of exchange in action: 
, long a = 4; ; 
long b = exchange(&a, 3); 
printf("a = %ld, b = 41d Werbe Ven", a, b); 


This code will:print ae Foam KA 


a=3,b=4 
uro gto? 3 » 


, * i 

The C operatdr *& (calléd.thé “address of" operator) cieate$.a pointer, in.this case to the location 
holding local variable a. Fünction;exchangg oyerwrites the value'stored in a with 3 but returns the 
previous value, 4, as the function Value. Qbserve haw by,passing a pointer to exchange, itcould modify , 
data held at some remoté location. + . T : un : 


PE 


Assume that the values of sp and dp are stored in. registers &rdi and %rsi, 
respectively. For each entry in the table, show the two instructions that implement 
the specified data movement. The first instruction in the sequence should read 
from memory, do the appropriate conversion, and set the appropriate portion of 
register %rax. The second instruction should then write the appropriate portion 
of %rax to memory. In both cases, the portions may be %rax, Xeax, Zax, or al, 
and they may differ from one another. 

Recall that when performing a cast that involves both a size change and a 
change of “signedness” in C, the operation should change the size first (Section 
2.2.6). 


src t dest. t Instruction 


EVA m Lu o ee E 
long long movq (Ardi), Xrax 

movq Arax, (Arsi) 
char int 








char ` unsigned 

unsigned char long LE Se 
int char = 
unsigned unsigned char 

char short 





Yoù are given the 





following information. A function with prototype 
void decodei(long *xp, long *yp, long *zp); 
is compiled into assembly code, yielding the following: 


void decodei(long *Xp, long *yp, long *zp) 
xp in %rdi, yp in Xrsi, zp in Xrdx 


decodat: 
movq (Ardi), %r8 
movq (4rsi), %rcx 
movq (rdx), %rax 


movq %r8, (Arsi) 
movq 4rcx, Cirdx) 
movq Arax, (rdi) 
ret 


Parameters xp, yp, and zp are stored in registers 4rdi, 4r$i, and %rdx, respec- 
tively. 

Write C code for decode1 that will have an effect equivalent to thé assembly 
code shown. 


3.4.4 Pushing and Popping Stack Data 


L] IM i 
The final two data movement operations are used to push data onto and pop data 
from'the program stack,:as documented in Figuré 3.8..As we will see, the stack 
plays a vital role in the handling of procedure calls. By way of background,a stack 
is d data structure where values can be added or deleted, but-only according to 
a “last-in, first-out” discipline. We add data to a stack via a push operation and 
remove it via a pop operation, with the property that the value popped will always 
be the value that was most recently pushed-and is still on.the stack. A stackican be 
implemented as an array, where we'always insert and remove elements from one 
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Instruction Effect Description 


MISSUM e ee 

pushq S R[%rsp] «- R[%rsp]—8; ^ Push quad word 
M[R[%rsp]] < S$ 

popa D D < MIR[4rsp]I Pop quad word 
R(%rsp] <— R[žrsp]+ 8 


Figure 3.8 Push and pop instructions. 


Initially pushq “rax popa %rdx 


Increasing 
address 


Stack “top Stack “top” 
Figure 3.9 Illustration of stack operation. By convention, we draw stacks upside down, 
so that the “top” of the stack is shown at the bottom. With x86-64, stacks grow toward 
lower addresses, so pushing involves decrementing the stack pointer (register 4rsp) and 
storing to memory, while popping involves reading from memory and incrementing the 
stack pointer. TA 


end of the array. This end is called the top of the stack. With x86-64, the program 
stack is stored in some region of memory. As illustrated in Figure 3.9, the stack 
grows downward such that the top element of the stack has the lowest address of 
all stack elements. (By convention, we draw stacks upside down, with the.stack 
“top” shown at the bottom of theigure.) The stack pointer Zrsp holds the address 
of the top stack element. mM 
The pushq instruction provides the ability to push data onto the stack, while 
the popq instruction pops it. Each of these instructions takes a single operand—the 
data source for pushing and the data destination for popping. : 
Pushing a quad: word value onto thè stack involves first decrementing the 
stack pointer -by 8 andithen writing the value at the new top-of-stack address. 
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Therefore, the behavior of the instruction pushq %rbp is equivalent to that of the 
pair of instructions 


subq $8,%rsp Decrement stack pointer 
movq Arbp, (Arsp) Store Xrbp on stack 


except that the pushq instruction is encoded in the machine code as a single byte, 
whereas the pair of instructions shown above requires a total of 8 bytes. The first 
two columns in Figure 3.9 illustrate the effect of executing the instruction pushq 
^rax When %rsp is 0x108 and %rax is 0x123. First arsp is decremented by 8, giving 
0x100, and then 0x123 is stored at memory address 0x100, 

Popping a quad word involves reading from the top-of-stack location and 
then incrementing the stack pointer by 8. Therefore, the instruction popq 4rax 
is equivalent to the following pair of instructions: 


movq (rsp) ,%rax Read %rax from stack 
addq $8,%rsp Increment stack pointer 


The third column of Figure 3.9 illustrates the effect of executing the instruction 
popq %edx immediately after executing the pushq. Value 0x123 is read from 
memory and written to register %rdx. Register Arsp is incremented back to 0x108. 
As shown in the figure, the value 0x123 remains at memory location 0x104 until it 
is overwritten (e.g., by another push operation). However, the stack top is always 
considered to be the address indicated by 4rsp. 

Since the stack is contained in the same memory as the program code and 
other forms of program data, programs can access arbitrary positions within the 
stack using the standard memory addressing methods. For example, assuming the 
topmost element of the stack is a quad word, the instruction movq 8(4rsp) , %rdx 
will copy the second quad word from the stack to register %rdx. 


3.5 Arithmetic:and Logical Operations 


Figure 3.10 lists some of the x86-64 integer and logic operations. Most of the 
operations are given as instruction classes, as they can have different variants with 
different operand sizes. (Only leaq has no other size variants.) For example, the 
instruction class ADD consists of four addition instructions: addb, addu, add1, and 
addq, adding bytes, words, double words, and quad words, respectively. Indeed, 
each of the instruction classes shown has instructions for operating on these four 
different sizes of data. The operations are divided into four groups: load effective 
address, unary, binary, and shifts, Binary operations have two operands, while 
unary operations have one operand. These operands are specified using the same 
notation as described in Section 3.4. 


3.5.1 Load Effective Address 


The load effective address instruction leaq is actually a variant of the movq in- 
struction. It has the form of an instruction that reads from memory to a register, 
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gp 


| Instruction Effect Description 
a —————M———————————— 
5 leaq S,D D — &S Load effective address 
INC D D + D+ Increment 
DEC D D — D-1 Decrement 
— n NEG D D — -D Negate 
4 X wor D D — -D Complement 
Lo ap S,D D< DHS Add 
; | sub $,D D — D-S Subtract 
: IMUL $,D D + D*S Multiply 
B ! xor S,D D <| D'S Exclusive-or 
H OR S, D D — DIS Or 
Ü AND S,D D + D&S And 
| i sal k,D D + D««k Left shift 
E suL  k,D D — D««k Left shift (same as sAL) 
E : SaR, k,D D — D»»Ak Arithmetic right shift 
E | sHR k,D D + D>>,k Logical right shift 


E |. Figure 3.10 Integer arithmetic operations. The load effective address (Leaq) 

n" instruction is commonly used to perform simple arithmetic. The remaining ones are ; 

' more standard unary or binary operations. We use the notation >>, and >> to denote 1 
arithmetic and logical right shift, respectively. Note the nonintuitive ordering of the 
operands with ATT-format assembly code. a 


but it does not reference memory at all. Its first operand appears to be a mem- 
ory reference, but instead of reading from the designated location, the instruction 
| copies the effective address to the destination. We indicate this computation in | | 
| Figure 3.10 using the C address operator &S. This instruction can be used to gener- $ 
ate pointers for later memory references. In addition, it can be used to compactly 
describe common arithmetic operations. For example, if register %rdx contains 
value x, then the instruction leaq 7 (%rdx,%rdx,4) , %rax will set register 4rax 
4 to 5x + 7. Compilers often find clever uses of leaq that have nothing to do with 
E effective address computations. The destination operand must be a register. 


ee 





é " Suppose register %rax holds value x and %rcx holds value y. Fillin the table below 
‘ with formulas‘indicating the value that will be stored in register rdx for each of 
the given assembly-code instructions: 





3 Instruction Result | 





3 
" leaq 6(%rax), %rdx 





leaq (%rax,%rcx), %rdx ernie 
leaq (%rax,%rcx,4), Wrdx TIRES | 
leaq 7(4rax,%rax,8), Wrdx 
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leaq OxA(,%rcx,4), %rdx ES 
leaq 9(4rax,%rcx,2), %rax 


ita 


As an illustration of the use of leaq in compiled code, consider the following 
C program: d 


long scale(long x, long y, long z) { s 
long t —- x * 4 * y 4 12 « 5; 9 
return t; 


} l 


When compiled, the arithmetic operations of the function are implemented 
by a sequence of three 1eaq functions, as is documented by the comments on the 
right-hand side: 


long scale(Jong x, long y, long z) 
x in žrdi, y in Xrsi, z in %rdx 


scale: 
leaq (rdi, 4rsi,4), Yrax X + 4xy 
leaq (Ardx,%rdx,2), %rdx zZ + Qez = Sez 
leaq Chrax,%rdx,4), %rax (xtdny) + de (Bz) = x + dey + 1242 
ret 


The ability of the 1eaq instruction'to perform addition and limited forms of 
multiplication proves useful when compiling simple arithmetic expressions such 
as this example. 





i Qt : Zils ona ad 
Consider the following code, in which we: have omitted the expression being 
computed: 






long scale2(long x, long y, long z) { 
long t - T 
return t; 





| 
| 
; 


Compiling the actual function with cce yields the following assembly code: 


long scale2(long x, long y, long z) 


x in frdi, y in Xrsi, z in Yrdx 


Scale2: | 
leaq Cirdi,%rdi,4), %rax l 
leaq (rax, %rsi,2), %rax | 
leaq Chrax,%rdx,8), %rax 
ret E 


Fill in the missing expression in the C code. 


—_— es ote 
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3.5.2 Unary and Binary Operations 


Operations in the second group are unary operations, with the single operand 
serving as both source and destination. This operand can be either a register or 
a memory location. For example, the instruction incq (Arsp) causes the 8-byte 
element on the top of the stack to be incremented. This syntax is reminiscent of 
the C increment (++) and decrement (—-) operators. 

The third group consists of binary operations, where the second operand 
is used as both a source and a destination. This syntax is reminiscent of the C 
assignment operators, such as x — y. Observe, however, that the source operand 
is given first and the destination second. This looks peculiar for noncommutative 
operations. For example, the instruction subq %rax,%rdx decrements register 
4xdx by the value in %rax. (It helps to read the instruction as “Subtract 4rax from 
%xrdx.”) The first operand can be either an immediate value, are gister, or a memory 
location. The second can be either a register or a memory location. As;with-the 
Mov instructions, the two operands cannot both be memory locations. Note that 
when the second operand is a memory location, the processor must read the value 
from memory, perform the operation, and then write the result back to memory. 


AC "Prol art Pt EH MEC CP ef Pa ro ci TOT ee M 7 n 


Assume the following values are stored at the indicated memory addresses and 
registers: 


Address Value Register Value 


0x100 OxFF %rax 0x100 
0x108 OxAB Arcx Ox1 
0x110 0x13 %rax 0x3 
0x118 0x11 


Fill in the following table showing the effects of the following instructions, 
in terms of both the register or memory location that will be updated and the 
resulting value: 


Instruction Destination Value 


addq %rex, (Arax) 

subg %rdx ,8(4rax) 

imulq $16, (hrax, 4rdx,8) 
incq 16 (%rax) 

decq Árcx 

subq Ardx,Arax 


A 








3.5.3 Shift Operations 
3 : 
The final group consists of shift operations, where the shift amount is given first | 


and the value to shift is given second. Both arithmetic and logical right shifts are 
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possible. The different shift instructions can specify the shift amount either as 
an immediate value or with the single-byte register %c1. (These instructions are 
unusual in only allowing this specific register as the operand.) In principle, having 
a 1-byte shift amount would make it possible to encode shift amounts ranging up 
to 28 — 1— 255. With x86-64, a shift instruction operating on data values that are 
w bits long determines the shift amount from the Jow-order m bits of register 
^cl, where 2" = w. The higher-order bits are ignored. So, for example, when 
register 4c1 has hexadecimal value OxFF, then instruction salb would shift by 
7, while salw would shift by 15, sa11 would shift by 31, and saiq would shift 
by 63. 

As Figure 3.10 indicates, there are two names for the left shift instruction: ga 
and sHL. Both have the same effect, filling from the right with zeros. The right 
shift instructions differ in that san performs an arithmetic shift (fill with copies of 
the sign bit), whereas sur performs a logical shift (fill with zeros). The destination 
operand of a shift operation can be either a register or a memory location. We 
denote the two different right shift operations in Figure 3.10 as >> 4 (arithmetic) 
and >> (logical). 





long shift left4 rightn(long x, long n) 


1 
X <<= 4; 
x >= n; 
return x; 
} 


The code that follows is a portion of the assembly code that performs the 
actual shifts and leaves the final value in register %rax. Two key instructions 
have been omitted. Parameters x and n are stored in registers %rdi and Arsi, 
respectively. 


long shift left4 rightn(long x, long n) 
x in frdi, n in %rsi 
Shift left4 rightn: 








movq ^rdi, %rax Get x 
X <<= 4 

movi 4esi, %ecx Get n (4 bytes) 
x »»- mn 


Fill in the missing instructions, following the annotations on the right. The 
right shift should be performed arithmetically. 


O 
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(a) C code , 
j i long arith(long x, long y, long z) 1 
| { | 
J | long ti- X ^ y; i 
T | long t2 = z * 48; 
-— a long t3 = ti & OxOFOFOFOF; 
l; i long t4 = t2 - t3; 
| return t4; 
| } 
(b) Assembly code 
* löng arith(long x, long y, long z) 
x in Zrdi,'y in frei, z in Ardx 
J 1 arith: 
i 2 xorg Wrsi, 4rdi ti-2x^y 
) 3 leaq (Urdx ,%rdx}2), %rał 3z 
4 4 salq $4, %rax t2 = 16 * (3*2) = 484z 
5 andl $252645135, Zedi t3 = ti & OxOFOFOFOF 
6 subq rdi, %rax Return t2 - t3 
7 ret 





Figure 3.11 C and assembly code'for arithmetic function. 


3.5.4 Discussion 


E We see that most of the instructions shown in Figure 3.10 can be used for either 
unsigned or two’s-complement arithmetic. Only right shifting requires instructions 
that differentiate between signed versus unsigned data. This is one of the features 
that makes two’s-complement arithmetic the preferred way to implement signed 
E integer arithmetic. 

IE Figure 3.11 shows an’example of a'fuhction that performs arithmetic opera- 
tions‘and its translation into assembly code. Arguments x, y, and z af¥'initidlly 
stored in registers %rdi, %rsi, and %rdx/ respectively’ The assembly-code instruc- 
tions correspond closely with the lines of C source code. Line 2 computes the value 
of x^y. Lines 3 and 4 compute the expression z+48 by a combination of 1eaq and 
: shift instructions. Line 5 computes the AND of t1 and OxOFOFOFOF. The final sub- 
j traction is computed by line 6. Since the destination of the subtraction is register 
E %rax, this will be the value returned by the function. 

i In the assembly code of Figure 3.11, the sequence of values in register Zrax 
D corresponds to program values 3*z, z*48, and t4 (as the return value). In general, 
f compilers generate code that uses individual registers for multiple program values 
and moves program values among the registers. 






In the following variant of.the function of Figure 3.11(a), the expressions have 
been replaced by blanks: 
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long arith2(long x, long y, long z) 


{ 
long ti = . . j 
long t2 = i 
long t3 = pani < 
long t4 = _ .. 5; 
return tà; 
} 


The portion of the generated assembly code implementing these expressions 
is as follows: 


long arith2(long x, long y, long z) 

x in rdi, y in‘%rsi, z in Kfdx 
arith2: g 

orq %rsi, 4rdi 

sarq $3, žrdi 


notu židi 

movq rdx, %rax 
subq Yrdi, %rax* 
ret 


Based on thisassembly code, fill in the missing portions of the C code. 





xorg, %rdx, {xdx 1 


in code that was generated from C where no EXCLUSIVE-OR operations were 
present. 


A. Explain the effectof this particular EXCLUSIVE-OR instruction and what useful 
operation it implements. 


B. What would be the more straightforward way to express this operation in 
assembly code? 


C. Compare the number of bytes to encode these two different implementa- 
tions of the same operation. 


3.5.5 Special Arithmetic Operations 


As we saw in Section 2.3, multiplying two 64-bit signed or unsigned integers can 
yield a product that requires 128 bits to represent. ‘The x86-64. instruction set 
provides limited support for operations involving 128-bit (16-byte) numbers. Con- 
tinuing with the naming convention of word (2 bytes), double'word (4 bytes), and 
quad word (8 bytes), Intel refers to a 16-byte quantity as an oct word. Figure 3.12 
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Instruction Effect Description 
ed ae tea RL 
imulq $ R[%rdx}:R[%rax] <- S x R[4rax] Signed full multiply 
mulq S R{%xdx]:R[%rax] < S x R[Arax] Unsigned full multiply 
cqto R[%rdx]:-R[%rax] < SignExtend(R[4rax]) Convert to oct word 
idivq $ R[%rdx] < R[%xrdx}:R[%Zrax] mod 5; Signed divide 
R[%rax] +- R[4rdx[R[Arax]-- $ 
divq S R[%rdx] < R[%rdx]}:R[%rax] mod 5; Unsigned divide 
R[%rax] < R[AXrdx[R[Arax]- 5 
Figure 3.12 Special arithmetic operations. These operations provide full 128-bit 


multiplication and division, for both signed and unsigned numbers. The pair of registers 
%rdx and %rax are viewed as forming a single 128-bit oct word. 


describes instructions that support generating the full 128-bit product of two 64-bit 
numbers, as well as integer division. 

The imulq instruction has two different forms One form, shown in Figure 3.10, 
is as a member of the IMUL instruction class. In this form, it serves as a “two- 
operand” multiply instruction, generating a 64-bit product from two 64-bit oper- 
ands. It implemehts the operations *?4 arid «t, described in Sections 2.3.4 and 2.3.5. 


(Recall that when truncating the product to 64 bits, both unsigned multiply and 
two’s-complement multiply have the same bit-level behavior.) 

Additionally, the x86-64 instruction set includes two different “one-operand” 
multiply instructions to compute the full 128-bit product of two 64-bit values— 
one for unsigned (mulq) and one for two's-complement (imulq) multiplication. 
For both of these instructions, one argument must be in register %rax, and the 
other is given as the instruction source operand. The product is then stored in 
registers %rdx“ (high-order 64 bits) and %rak (low-order 64 bits). Although the 
name imulq is used for two distinct multiplication operations, the assembler can 
tell which one is intended by counting the number of operands. 

As an example, the following C code demonstrates the generation ofa 128-bit 
product of two unsigned 64-bit numbers x and y: 


#include <inttypes.h> 
typedef unsigned __int128 uint128_t; 


void store_uprod(uint128_t *dest, uint64 t x, uint64 t yt 
*dest = x * (uinti28 t) y; 7 
} 


I 
In this program, we explicitly declare x and y to be 64-bit numbers, using defi- 
nitions declared in the file inttypes.h, as part of an extension of the C standard. 
Unfortunately, this standard does not make provisions for 128-bit values. Instead, 
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we rely on support provided by ccc for 128-bit integers, declared using the name 
__int128. Our code uses a typedef declaration to define data type uint128_t, 
following the naming pattern for other data types found in inttypes.h. The code 
specifies that the resulting product should be stored at the 16 bytes designated by 
pointer dest. 

The assembly code generated by ccc for this function is as follows: 































void store uprod(uinti28 t *dest, uint64 t x, uint64 t y) 
dest in žrdi, x in 4rsi, y in %rdx 
store uprod: 


1 

2 movq Wrsi, %rax Copy x to multiplicand 

3 mulq %rax Multiply by y 

4 movq érax, Cérdi) Store lower 8 bytes at dest 

5 movq Ardx, 8(|rdi) store upper 8 bytes at dest+8 
6 ret 


Observe that storing the product requires two movq instructions: one for the 
low-order 8 bytes (line 4), and one for the high-order 8 "bytes (line 5). Since the 
code is generated for a little-endian machine, the high-order bytes are stored at 
higher addresses, as indicated by the address specification 8 (rdi). 

Our earlier table of arithmetic operations (Figure 3.10) does not list any 
division or modulus operations. These operations are provided by the single- 
operand divide instructions similar to the single-operand multiply instructions. 
The signed division instruction idivl takes as its dividend the 128-bit quantity 
in registers 4rdx (high-order 64 bits) and %rax (low-order 64 bits). The divisor is 
given as the instruction operand. The instruction stores the quotient in register 
Zrax and the remainder in register 4rdx. 

For most applications of 64-bit addition, the dividend is given as a 64-bit value. 
This value should be stored in register Zrax. The bits of 4rdx should then be set to 
either all zeros (unsigned arithmetic) or the sign bit of %rax (signed arithmetic). 
The latter operation can be performed using the instruction cqto 2 This instruction 
takes no operands—it implicitly reads the sign bit from 4rax and copies it across 
all of %rdx. ] 

Asan illustration of the implementation of division with x86-64, the following 
C function computes the quotient and remainder of two 64-bit, signed numbers: 


void remdiv(long x, long y, 
long *qp, long *rp) { 
long q = x/y; 
Xy; 


. This instruction is called cqo in the Intel documentation, one òf the few cases where the ATT-format 
ame for an instruction does not match the Intel name, 
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This compiles to the following assembly code: 


void remdiv(long x, long y, long *qp, long *rp) . 
x in Zrdi, y in &rsi, qp in 4rdx, rp in Arex 

1 remdiv: 

2 movq %rdx, 4r8 Copy qp ; 

3 movq Xrdi, %rax Move x to lower 8 bytes of dividend 

4 cqto Sign-extend to upper 8 bytes of dividend 

5 idivq ‘%rsi Divide by y 

6 movq rax, (%r8) Store quotient at qp 

7 movq rdx, (%rcx) Store remainder at rp 

8 ret 


In this code, argument rp must first be saved ip a different register (line 2), 
since argument register %rdx is required for the division opération. Lines 3-4 then 
prepare the dividend by copying and sign-extending x. Following the division, the 
quotient in register Xyax gets stored at qp (line 6), while the remainder in register 
%rdx gets stored at rp (line 7). 

Unsigned division makes, use of the divq instruction. Typically, register %rdx 
is set to zero beforehand. 3 










Practice Problém3: 12" Golutioo page.s291s e : A 
Consider the following function for computing the quotient and remainder of two 
unsigned 64-bit numbers: 


void uremdiv(unsigned long x, unsigned long y, 
unsigned long *qp, unsigned long *rp)+ { 
unsigned long q = x/y; 
unsigned long r = xAy; 
*qp = q; d 
*rp = rj 1 


} - , 


Lss waw 





1 


3.6 Control 


So far, we have only considered the behavior of straight-line code, where instruc- 
tions follow one another in sequence. Some constructs in C, such as conditionals, 
loops, and switches, require conditional execution, where the sequence of oper- 
ations that get performed depends on the outcomes of tests applied to the data. 
Machine code provides two basic low-level mechanisms for implementing condi- 
tional behavior: it tests data values and then alters either the control flow or the 
data flow based on the results of these tests. 

Datg-dependent control flow is the more general and more common approach 
for implementing conditional behavior, and so we will examine this first. Normally, 


a n O ER s 
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both statements in C and instructions in machine code are executed sequentially, 
in the order they appear in the program. The execution order of a set of machine- 
code instructions can be altered with a jump instruction, indicating that control 
should pass to some other part of the program, possibly contingent on the result 
of some test. The compiler must generate instruction sequences that build upon 
this low-level mechanism to implement the control constructs of C. 

In our presentation, we first cover the two ways of implementing conditional 
operations. We then describe methods for presenting loops and switch state- 
ments, 


3.6.1 Condition Codes 


In addition to the integer registers, the CPU maintains a set of single-bit condition 
code registers describing attributes of the most recent arithmetic or logical oper- 
ation. These registers can then bé tested to perform conditional branches, These 
condition codes are the most useful: 


CF: Carry flag. The most recent operation generated a carry out of the most 
significant bit. Used to detect overflow for unsigned operations. 

ZF: Zero flag. The most recent operation yielded zero. 

SF: Sign flag. The most recent operation yielded a negative value. 

DF: Overflow flag. The most recent operation caused a two's-complement 
overflow—-either negative or positive. 


For example, suppose we used one of the app instructions to perform the 
equivalent of the C assignment t = a*b, where variables a, b, and t are integers. 
Then the condition codes would be set according to the following C expressions: 


CF (unsigned) t « (unsigned) a Unsigned overflow 
ZF (t == 0) Zero 
SF  (t*o Negative 


OF  (a«0--b«O0O)&&(t«0O!-a«0) Signed overflow 


The 1eaq instruction does not alter any condition codes, since it is intended 
to be used in address computations. 'Otherwise; all of the instructions listed in 
Figure 3.10 cause the condition codes to be set. For thé logical operations, such 
as XOR, the carry aud overflow flags are set to zero. For the shift operations, the 
carry flag is set to the last bit shifted out, while the overflow flag is set to zero. For 
reasons that we will not delve into, the mvc and pec instructions set the overflow 
and zero flags, but they leave the carry flag unchanged. 

In addition to the setting of condition.codes by the instructions of Figure 3.10, 
there are two instruction classes (having 8-, 16-, 32-, and 64-bit forms) that set 
condition codes without altering any other registers; these are listedin Figure 3.13. 
The cmp instructions set the condition codes according to the differences of their 
two operands. They behave in the-sdme way as the sus instructions, except that 
they set the condition codes without updating their destinations. With ATT format, 
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np 


Instruction Based on Description 


CMP Sy, $5 Sq — Sy Compare 
cmpb Compare byte 
cmpw Compare word 
empl Compare double word 
cmpq Compare quad word 


TEST 51, 55 Si & Sy Test 
testb Test byte 
testw Test word 
testl Test double word 
testq Test quad word 


Figure 3.13 Comparison and test instructions. These instructions set the condition 
codes without updating any other registers. 


the operands are listed in reverse order, making the code difficult to read. These 
instructions set the zero flag if the two operands are equal. The other flags can 
be used to determine ordering relations between the two operands. The TEST 
instructions behave in the same manner as the AND instructions, except that they 
set the condition codes without'altering their destinations. Typically, the same 
operand is repeated (e.g., testq rax, %rax to see whether %rax is negative, Zero, 
or positive), or one of the operands is a mask indicating which bits should be tested. 


3.6.2 Accessing the Condition Codes 


Rather than reading the condition codes directly,.there are three common ways 
of using the condition codes: (1) we can set a single byte to 0 or 1 depending 
on some combination of the condition codes, (2) we can conditionally jump to 
some other part of the program, or (3) we can conditionally transfer data. For the 
first case, the instructions described in Figure 3.14 set a single byte to 0 or tol 
depending on some combination of the condition codes. We refer to this entire 
class of instructions as the ser instructions; they differ from one another based on 
which combinations of conditioh codes they consider, as indicated by the different 
suffixes for the instruction names. It is important to recognize that the suffixes for 
these instructions denote different conditions and not different operand sizes. For 
example, instructions set1 and setb. denote “set less” and “set below,” not “set 
long word” or “set byte.” E 

A ser instruction has either dne of the low-order single-byte register elements 
(Figure 3.2) or a sifigle-byte memory location as its destination, setting this byte to 
either 0 or 1. To generate a 32-bit or 64-bit result, we must also clear the highsorder 
bits. A typical instruction sequence to compute the C expression a « b, where a 
and b are both of type long, proceeds as follows: 

i 
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Instruction Synonym Effect , Set condition 
sete D setz D «q OF Equal / zero 
setne D .Betnz D e -ZF , Not equal / not zero 
sets D D — SF Negative 
setns D" D <+ ~SF Nonnegative 
setg D setnle D <- ~(SF*OF)&~ZF Greater (signed >) 
setge D 'setni D < ~(SF* OF) Greater'or equal (signed >=) 
setl D setnge D <+ SF^OF Less (signed <) 
setle D , setng D < (SF*OF)| ZF i ' Less Br equal (signed «-) 
” s 
seta, HDi; + setnbe D <- +CF&+rZF Above:(unsigned >) 
setae D. setnb D <~ CF Above òr equal (unsigned >=) 
setb D sethae D «— CF Below (unsigned'«) 
setbà D setna D < CF ZF Below or equal.(unsigned <=) 


T 


da 

Figure 3.14 The ser instructions. Each instruction sets a single byte to 0 or:1 based on 
some combination of the condition:codes. Some instructions have “synonyms,” that is, 
alternate names for the same machine instruction. t 


int comp(data_t a, data_t b) 
a in %rdi, b in Zrsi 


1 comp: 

2 cmpq %rsi, 4rdi Compare a:b 

3 setl %al Set low-order byte of Zeax to 0 or 1 
4 movzbi %al, %eax Clear rest of %eax (and rest of f%rax) 
5 ret 


Note the comparison order of the cmpq instruction (line 2). Although the 
arguments are listed in the order %rsi (b), then %rdi (a), the comparison is 
really between a and b. Recall also, as discussed'in Sectidn 342, that the movzbl 
instruction (line 4) clears not just the high-order 3 bytes of %eax, but thé upper 4 
bytes of the entire register, %rax, as well. 

For some of the underlying machine instructions, there are multiple possible 
names, Which Welist as ‘ ‘synonyms.” For example, both setg (for “set gréater”) 
and setnle (for “set not less or equal”) refer to the same machine iti$truction. 
Cómpilers and disassemblers inake arbitrary choices of which ndthes to use. 

Although all arithmetic and logical operations set the condition codes, the de- 
scriptions of the different ser instructions apply to the.case where a comparison 
instruction has been executed, setting the condition codes according tothe com- 
putation t = a-b. More specifically, let a, b, and ¢ be the integers represented in 
two's-complement form by variables a, b, and t, respectively, and so t =a -5, b, 
where w depends on the sizes associated with a and b. 
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Consider the sete, or “set when equal,” instruction. When a = b, we will 
have t = 0, and hence the zero flag indicates equality. Similarly, consider testing 
for signed comparison with the set1, or “set when less," instruction. When no 
overflow occurs (indicated by having OF set to 0), we will have a < b whena ~, b < 
0, indicated by having SF set to 1, and a > b whena o b > 0, indicated by having 
SF set to 0. On the other hand, when overflow occurs, we will have a « b when 
a -', b > 0 (negative overflow) and a > b when a -b <0 (positive overflow). We 
cannot have overflow when a = b. Thus, when OF is set to 1, we will have a « b if 
and only if SF is set to 0. Combining these cases, the EXCLUSIVE-OR of the overflow 
and sign bits provides a test for whether a < b. The other signed comparison tests 
are based on other combinations of SF ^ OF and ZF. 

For the testing of unsigned comparisons, we now let a and b be the integers 
represented in unsigned form by variables a and b. In performing the computation 
t = a-b, the carry flag will be set by the cmp instruction when a — b < 0, and so the 
unsigned comparisons use combinations of the carry and zero flags. 

It is important to note how machine code does or does not distinguish be- 
tween signed and unsigned values. Unlike in C, it does not associate a data type 
with each program value. Instead, it mostly uses the same instructions for the two 
cases, because many arithmetic operations háve the same bit-level behavior for 
unsigned and two's-complement arithmetic. Some circunistances require different 
instructions to handle signed and unsigned operations, such as using differ- 
ent versions of right shifts, division and multiplication instructions, and different 
combinations of condition codes. 


The C code 


int comp(data t a, data t b) f 
return a COMP b; 
} 


— 


shows a general comparison between arguments a and b, where data. t, the data 
type of the arguments, is defined (via typedef) to be one of the integer data types 
listed in Figure 3.1 and either signed or unsigned. The comparison COMP is defined 
via #def ine. 

Suppose a is in some portion of %rdx while b is in some portion of rsi. For 
each of the following instruction sequences, determine which data types data. t 
and which comparisons COMP could cause the compiler to generate this code. 
(There can be multiple correct answers; you should list them all.) 


A. cmpl fesi, %edi 
setl hal 


cmpw 48i, Adi 
setge al 
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C.  cmpb ASil, XdiL." 
Setbe hal 


D.  cmpq *rsi, %rdi 
setne Ya 


The C code 





int test(data_t a) { 
return a TEST 0; 
} 


shows a general comparison between argumenf a and 0, where we can set the 
data type of the argument by declaring data_t with a typedef, and the nature 
of the comparison by declari B TEST with a #define declaration. The following 
instruction sequences implement the comparison, where a is held in some portion 
of register žrdi. For each sequence, determine which data types data t and which 
comparisons TEST, could cause the compiler to generate this code. (There can be 
multiple correct answers; list all correct ones.) 


« 


i «f 
A.  testq "rdi, %rdi 
setge al 


B.  testw "di, %di 
Sete žal i 


Jl, Ne 
C.  testb 4dil; %dil 


seta hal TA 


D. testl %edi, %edi 

setle fal eU i 

3.6.3 Jump Instructions 
" a an Ors 

Under norma] execution, instructions follow each other in the order they. are 
listed. A jump instruction can cause the execution to. switch to 3 completely 
new position in the program. These jump destinations are generally indicated in 
assembly code by a abel. Consider the following (very contrived) assembly-code 


sequence: " 

movq $0,%rax Set Zrax to 0 

jmp .L1 Goto .L1 

moXq (4rax) ,%rdx Null pointer dereference (skipped) 
Li: l e 


popq «Ardx Jump target 


Ue 
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Instruction Synonym = Jump condition Description 
jmp Label Direct jump 
jmp *Operand Indirect jump 
je Label j Equal / zero 


jne Label j Not equal / not zero 


js Label Negative 
jns Label Nonnegative 


jg Label jnle ~(SF*OF)&~ZF Greater (signed >) 

jge Label jnl -(SF ^ OF) Greater or equal (signed >=) 
j} Label jnge SF ^ OF Less (signed «) 

jle Label jng (SF ^ OF) | ZF Less or equal (signed <=) 


A. 


ja Label jnbe -CF & ~ZF Above (unsigned >) 

jae Label jnb ~CF Above or equal (unsigned >=): 
jb Label jnae CF Below (unsigned <) 

jbe* Label jna CF | ZF Below or equal (unsigned <=) 


?1 
Figure 3.15 The jump instructions. These instructions jump to a labeled destination 
when the jump condition holds. Some instructions have "synonyms," alternate names 


for the same machine instruction. 


The instruction jmp .L1 will cause the program to skip over the movq instruc- 
tion and instead resume execution with the popq instruction. In generating the 
object-code file, the assembler determines the addresses of all labeled instruc- 
tions and encodes the jump targets (the addresses of the destination instructions) 
as part of the jump instructions. = 

Figure 3.15 shows the different jump instructions. The jmp instruction jumps 
unconditionally. It can be either a direct jump, where the jump target is encoded 
as part of the instruction, or an indirect jump, where the jump target is read from 
a register or a memory location. Direct jumps are written in- assembly code by 
giving a label as the jump target, for example, the label .L1 in the code shown. 
Indirect jumps’are written using **' followed by an operand specifier using one of 


the memory operand formats described in Figure 3.3. As examples, thé instruction 
» f 
jmp *4rax 


uses the value in register %rax as the jump target, and the instruction 
jmp *(%rax) 


reads the jump target from memory, using the value in %rax as the read address. 

The remaining jump instructions in the table are conditional—they either 
jump or continue executing at the next instruction in the code sequence, depending 
on some combination of the condition codes. The names of these instructions 








and the conditions under which they jump match those of the ser instructions 
(see Figure 3.14). As with the ser instructions, some of the underlying machine 
instructions have multiple names. Conditional jumps can only be direct. 


3.6.4 Jump Instruction Encodings 


For the most part, we will not concern ourselves with the detailed format of ma- 
chine code. On the other hand, understanding how the targets of jump instructions 
are encoded will become important when we study linking in Chapter 7. In ad- 
dition, it helps when interpreting the output of a disassembler. In assembly code, 
jump targets are written using symbolic labels. The assembler, and later the linker, 
generate the proper encodings of the jump targets. There are several different en- 
codings for jumps, but some of the most commonly used ones are PC relative. That 
is, they encode the difference between the address of the target instruction and 
the address of the instruction immediately following the jump. These offsets can 
be encoded using 1, 2, or 4 bytes. A second encoding method is to give an *abso- 
lute” address, using 4 bytes to directly specify the target. The assembler and linker 
select the appropriate encodings of the jump destinations. 

As an example of PC-relative addressing, the following assembly code for a 
function was generated by compiling a file branch. c. It contains two jumps: the 
jmp iristruction on line 2 jumps forward to a highér address, while the jg instruction 
on line 7 jumps back to a lower one. 


1 movq žrdi, 4rax 
2 jmp .L2 

3 .L3: 

4 sarq Arax 

5 .L2: 

6 testq ‘%rax, Arax 
7 jg «L3 

8 rep; ret 


The disassembled version of the .o format generated by the assembler is as 
follows: 


1 0: 48 89 f8 mov Wrdi,Arax 

2 3: eb 03 jmp 8 <loop+0x8> 
3 5: 48 di f8 sar “vax 

4 8: 48 85 cO test —Wrax,Arax 

5 b: 7f f8 jg 5 <loop+0x5> 
6 d: f3 c3 repz retq 


In the annotations on the right generated by the disassembler, the jump targets 
are indicated as Ox& for the jump instruction, on line*2 and 0x5 for the jump 
instruction on line 5 (the disassembler lists all numbers in hexadecimal). Looking 
at the byte encodings of the instructions, however, we see that the target of the first 
jump instruction is encoded (in the second byte) as 0x03. Adding this to 0x5, the 
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n i Line 8 of the assembly code showh ori page 207 contains tlie instruction ‘combinatioir fep; fet'"These i 


F : are rendered-in therdisassembled code (line 6) as repz retq. One can infer that repz is a synonym * 


2 |. 

t for rep, just as retq is a synonym for ret. Looking at the Intel arid AMD ‘docunientation«for the 1 
B rep instruction, we find that it is normally used to implement a repeating string operation (3, 51]. It 
; j *seems completely inappropriate’ héré. The answét to this puzzle cam be.seen ‘in AMD'S Ruidelines to 
P compiler.writers [1]. They récoriimend using'the combination of rep'f "followed by ret to avoid making 
r the ret instruction: tlie destination of a conditional jump i instruction Without’ the 1 rep instruction, the 
jg instruction (line 7 Of | the assembly code) would ptoceéd to the ret ‘instruction when the brahtitis not? 

taken. According to AMD, theif prodéssors éanot properly predict the déstifidtion'of à ret instruction 
i whén it is reached from aijulnp instruction: The rép ‘insttuctiop, serves as a'fórm of ho:opération here,, 
l and so inserting it as the jump destiriationsdoes’ riot change beliávior of the code, except to. make it 
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; address of the following instruction, we get jump target address 0x8, the address 

i of the instruction on line 4. 
| Similarly, the target of the second jump instruction is encoded as 0x£8 (deci- 

à mal —8) using a single-byte two's-complement representation. Adding this to Oxd 

1 (decimal 13), the address of the instruction on line 6, we get 0x5, the address of 

the instruction on line 3. 
f As these examples illustrate, the value of the program counter when perform- 
ing PC-relative addressing is the address of the instruction following the jump, not 
that of the jump itself. This convention dates back to early implementations, when f 
the processor would update the program counter as its first step in executing an 
instruction. 

The following shows the disassembled version of the program after linking: : 








| 1 4004d0: 48 89 f8 mov %rdi,%rax | 
l | 2 4004d3: eb 03 jmp 4004d8 <loop+0x8> 
IE 3 4004d5: 48 di f8 sar %rax | 
= 4 400448: 48 85 cO test %rax,%rax | 
q 1 5 4004db: 7f f8 jg 4004d5 <loop+0x5> 
| 6 4004dd: £3 c3 repz retq | 

| The instructions have been relocated to different addresses, but the encodings 

x of the jump targets in lines 2 and 5 remain unchanged. By using a PC-relative 

1 encoding of the jump targets, the instructions‘can be compactly encoded (requiring 


just 2 bytes), and the object code can be shifted.to different positions in memory 
$ without alteration. 
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In the following excerpts from a 
been replaced by X's. Answer the following questions about these instructions. 
# 


A. What is the target of the je instruction below? (You do not need to know 
anything about the callq instruction here.) 


4003fa: 74 02 je XXXXXX 
4003fc: ff dO callq *%rax 


B. What is the target of the je instruction below? 


40042f: 74 f4 je XXXXXX 
400431: 5d pop “4cbp 


C. What is the address of the ja and pop instructions? 


XXXXXX: 77 02 ja 400547 
XXXXXX: 5d pop Vrbp 


D. Inthe code that follows, the jump target is encoded in PC-relative form as a 4- 
byte two's-complement number. The bytes are listed from least significant to 
most, reflecting the little-endian byte ordering of x86-64. What is the address 
of the jump target? 


4005e8: e9 73 ff ff ff jmpa XXXXXXX 
4005ed: 90 nop 


The jump instructions provide,a means to implement conditional execution 
(if), as well as several different loop constructs. 


3.6.5 Implementing Conditional Branches with Conditional Control 


The most general way to translate conditional expressions and statements from 
C into machine code is to use combinations of conditional and unconditional 
jumps. (As an alternative, we wil! see in Section 3.6.6 that some conditionals 
can be implemented by conditional transfers of data rather than control.) For 
example, Figure 3.16(a) shows the C code for a function that computes the absolute 
value of the difference of two numbers? The function also has a side effect of 
incrementing one of two counters, encoded as global variables 1t_cnt and ge - 
cnt. Gcc generates the assembly code shown as Figure 3.16(c). Our rendition of 
the'machine code into C is shown as the function gotodiff, se (Figure 3.16(b)). 
It uses the goto statement in C, which is similar to the unconditional jump of 


3. Actually, it can return a negative value if one of the subtractions overflows. Our interest-here is to 
demonstrate machine code, not to implement robust code. 
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| (a) Original C code (b) Equivalent goto version 
F long lt cnt = 0; 1 long gotodiff se(long x, long y) 
[ . long ge cnt - Q; 2 1 
P 3 long result; 
q long absdiff_se(long x, long y) 4 if (x >= y) 
i f i { 5 goto x_ge_y; 
ME i long result; 6 lt cnt-**; 
tod if (x < y) { 7 result = y- x; 
1 lt cnt4*; 8 return result; 
; result = y - X; 9 x ge.y: 
i } 10 ge. cntt*; 
| : else { 1 result = x - y; 
f ge_cnt++; 12 return result; 
result = x - y; 13 } 
} 
| return result; 
{ } 
1 (c) Generated assembly code 
| | long Absdiff 4&(1ong x, long y) 
i x in žrdi, y in frei 
| 1 1  absdiff se: 
EH ° 2 empq Ursi, 4rdi Compare,x:y 
| q 3 jge .L2 L If >= goto x_ge_y 
| f: 4 addq $1, 1t_cnt(%rip) lt cnt 
i 5 movq  %rsi, Arax 
bo 6 subq Xrdi, *rax result =y - x 
| i 7 ret * Return > 
| 4 8 .L2: x.ge y: Va 
i 9 addq $1, ge.cnt(4rip) ge.cnt** 
| | l 10 movq Xrdi, 4rax! dn 
| ¥ 11 subq KWrsi, Arax result = x - y 
| | 12 rei' MU Return 


} f 
| Figure 3.16 Compilation of conditional statements. (a) C; procedure absdiff se 
| 4 contains an.if-else statement. The generated assembly cade is shown (c), along with 
|od (b) a C procedure gotpdiff, se that mimics the control flow of the assembly code. 
1 13 {1 
: , ap 
assembly code: Using goto statements is generally considered a bad programming 
style, since their use can make code very difficult to read and debug. We use them 
in our-presentation as a way'to construct C programs that describe the control 
flow of machine code. We call this style of programming "goto code." 
In the goto code (Figure 3.16(b)), the statement goto x, ge. y on line 5 causes 
a jump to the label x; ge. y (since it occurs when x > y) on line 9. Continuing the 
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| j Figure 3.16 shows an example of how we will demonstratésthe translation o£,C language control 
constructs into machine code. The figure contains an example C function (a) and an annotated version 
of the assembly , code generated by GCC (c). It also contains a version in C that closely matches the 
structure of the ássembly code (b). Although these versions were generated in the sequence (a), (c), 

i } and (b), we recominend that you read them in the order (a); (b), and then (c). That is, the C rendition 


of the machine code will help you understand the key points, and this cap guide you in understanding 
| the actual assembly code. 
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execution from this point, it completes the computations specified by the else 
portion of function absdiff_se and returns. On the other hand, if the test x >= y 
fails, the program procedure will carry out the steps specified by the if portion of 
| absdiff_se and return. 
The assembly-code implementation (Figure 3.16(c)) first compares the two 
| operands (line 2), setting the condition codes, If the comparison result indicates 
that x is greater than or equal to y, it then jumps to a block of code starting at 
line 8 that increments global variable ge, cnt, computes x-y as the return value, 
and returns. Otherwise, it continues with the execution of code beginning at line 
4 that increments global variable 1t. cnt, computes y-x as the return value, and 
| 1 returns. We can see, then, that the control flow of the assembly code generated for 
absdiff_se closely follows the goto code of gotodiff_se. 
| The general form of an if-else statement in C is given by the template 


if (test-expr) 
then-statement 
else 
else-statement 


| where test-expr is an integer expression that evaluates either to zero (interpreted 
| as meaning “false”) or to a nonzero value (interpreted as rheaning “true”). Only 
| one of the two branch statements (then-statement or else-statement) is executed. 

For this general form, the assembly implementation typically adheres to the 
following form, where we use C syntax to describe the contro] flow: 


t = test-expr; 
| if (!t) 
| goto false; 
i then-statement 
goto done; 
false: 
else-statement 
done: 
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That is, the compiler generates separate blocks of code for then-statement and 
else-statement. It inserts conditional and unconditional branches to make sure the 
correct block is executed. " 





When g given the C Gode 


void cond(long a, long *p) 
1 
if (p k& a > *p) 
*p = 
} 


Gcc generates the following assembly code: 


void cond(iong a, long *p) 
a in Xrdi, p in frsi 


cond: 
testq  "rsi, %rsi 
jé ii 
cmpq ardi, Cérsi) 
jga .Li 
movq Xrdi, (irsi) 
Li: S 
rep; ret 


A. Write a goto version in C that performs the same computation and mimics 
the control flow of the assembly code, in the style shown in Figure 3.16(b). 
You might find it helpful to first annotate the assembly code as we have done 
in our examples. 

B. Explain why the assembly code contains two conditional branches, even 
though the C coge, has only one if statement. 7 





An atente rule for translating if statements into gota code is as follows: 


t = test-expr; 
if (t) 

Eoto true; 
else-statement 
goto done; 

true: 
then-statement 
done: 
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A. Rewrite the goto versión of absdiff_se based ori this alternate rule. 


B. Can you think of any reasons for choosing one rule over the other? 





Practice: Problem. 3:18: (solutio. 
de of the form 





Starting with C co 
long test(long x, long y, long z) ( 
long val =. 
if ( "d 
if C mi) 
val - 
else 
val = i 
} elge if (0. .) 
val = 








return val; 


} 


Gcc generates the following assembly code: 





long test (long x, long y, long z) 

x in %rdi, y in Zrsi, z in %rdx 
test: 

leaq (rdi, Arsi), %rax 

addq #rdx, %rax 

cmpq $-3, %rdi 


jge .L2 
cmpq Ardx, %rsi 
jge .L3 
movq rdi, %rax 
imulq ‘%rsi, %rax 
ret 
.L3: 
movq Wrsi, %rax 
imulq %rdx, %rax 
Tet 
.L2: 
cmpq $2, %rdi 
jle .L4 
movq žrdi, %rax 
imulq &%rdx, %rax 
I4: 
rep; ret 


Hill in the missing expressions in the C code. 
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3.6.6 Implementing Gonditional Branches with Conditional Moves 


The conventiónal way to implement conditional operations is through a condi- 
tional transfer of control, where the program follows one execution path when 
a condition holds and another when it does not. This mechanism is simple and 
general, but it can be very inefficient on modern processors. 

An alternate strategy is through a conditional transfer of data. This approach 
computes both outcomes of a conditional operation and then selects one based on 
whether or not the condition holds. This strategy makes sense only in restricted 
cases, but it can then be implemented by a simple conditional move instruction 
that is better matched to the performance characteristics of modern processors. 
Here, we examine this strategy and its implementation with x86-64. 

Figure 3.17(a) shows an example of code that can be compiled using a condi- 
tional move. The function computes the absolute value of its arguments x and y, 
as did our earlier example (Figure 3.16). Whereas the earlier example had side ef- 
fects in the branches, modifying the value of either 1t_cnt or ge_cnt, this vérsion 
simply computes the value to be returned by the function. 


(a) Original C code (b) Implementation using conditional assignment 


long cmovdiff(long x, long y) 
1 


long absdiff(long x, long y) 
{ 


long rval = y-x; 
long eval = x-y; 


long result; 
if (x < y) 


/* Line below requires 

single instruction: */ 
if (ntest) rval = eval; 
return rval; 


else 
result = x - y; 
return result; 


1 
2 
3 
4 
result = y - X; 5 long ntest =x >= y; 
6 
7 
8 
9 


10 


(c) Generated assembly code 


long absdiff(long x, long y) 

x in %rdi, y in frsi 

1 absdiff: 

2 movq rsi, 4rax 

3 subq žrdi, Arax rval - y-x 

4 movq žrdi, %rdx 

5 subq rsi, Ardx eval = x-y 

6 cmpq %rsi, rdi Compare x:y 

7 cmovge %rdx, rax If >=, rval = eval 
8 ret Return tval 
Figure 3.17 Compilation of conditional statements using conditional assignment. (a) C function 


absdiff contains a conditional expression. The generated assembly code is shown (c), along with (b) a 
C function cmovdiff that mimics the operation of the assembly code. 
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For this function, Gcc generates the assembly code shown in Figure 3.17(c), 
having an approximate form shown by the C function cmovdiff shown in Figure 
317(b). Studying the C version, we can see that it computes both y-x and x-y, 
naming these rval and eval, respectively. It then tests whether x is greater than 
or equal to y, and if so, copies eval to rval before returning rval. The assembly 
code in Figure 3.17(c) follows the same logic. The key is that the single cmovge 
instruction (line 7) of the assembly code implements the conditional assignment 
(line 8) of cmovdiff. It will transfer the data from the source register to the 
destination, only if the cmpq instruction of line 6 indicates that one value is greater 
than or equal to the other (as indicated by the suffix ge). 

To. understand why code based on conditional data transfers can outperform 
code based on conditional control transfers (as in Figure 3.16), we must understand 
something about how modern processors operate. As we will see in Chapters 4 
and 5, processors achieve high performance through pipelining, where an instruc- 
tion is processed via a sequence of stages, each performing one small portion of 
the required operations (e.g., fetching the instruction from memory, determining 
the instruction type, reading from memory, performing an arithmetic operation, 
writing to memory, and updating the program counter). This approach achieves 
high performance by overlapping the steps of the successive instructions, such 
as fetching one instruction while performing the arithmetic operations for a pre- 
vious instruction. To do this requires being able to determine the sequence of 
instructions to be executed well ahead of time in order to keep the pipeline fuil of 
instructions to be executed. When the machine encounters a conditional jump (re- 
ferred to as a "branch"), it cannot determine which way the branch will go until it 
has evaluated the branch condition. Processors employ sophisticated branch pre- 
diction logic to try to guess whether or not each jump instruction will be followed. 
As long as it can guess reliably (modern microprocessor designs try to achieve 
success rates on the order of 90%), the instruction pipeline will be kept full of 
instructions. Mispredicting a jump, on the other hand, requires that the processor 
discard much of the work it has already done on future instructions and then begin 
filling the pipeline with instructions starting at the correct location. As we will see, 
such a misprediction can incur a serious penalty, say, 15-30 clock cycles of wasted 
effort, causing a serious degradation of program performance. 

As an example, we ran timings of the absdiff function on an Inte] Haswell 
processor using both methods of implementing the conditional operation. In a 
typical application, the outcome of the test x « y is highly unpredictable, and 
so even the most sophisticated branch prediction hardware wiil guess correctly 
only around 5096 of the time. In addition, the computations performed in each 
of the two code sequences require only a single clock cycle. As a consequence, 
the branch misprediction penalty dominates the performance of this function. For 
x86-64 code with conditional jumps, we found that the function requires around 8 
clock cycles per call when the branching pattern is easily predictable, and around 
17.50 clock cycles per call when the branching pattern is random. From this, we can 
infer that the branch misprediction penalty is around 19 clock cycles. That means 
time required by the function ranges between around 8 and 27 cycles, depending 
on whether or not the branch is predicted correctly. 
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On the other hand, the code compiled using conditional moves requires 
around 8 clock cycles regardless of the data being tested. The flow of control 
does not depend on data, and this makes it easier for the processor to keep'its 
pipeline full. 





mes on an older processor model, our ROMA required around 16 icles when 
the branching pattern was highly predictable, and around 31 cycles when.the 
pattern was random. 


A. What is the approximate miss penalty? 


B. How many cycles would the function require when the branch is mispre- 
dicted?, 


Figure 3.18 illustrates some of the conditional move instructions available with 
x86-64. Each of these instructions has two operands: a source register or memory 
location S, and a destination register R. As with the different ser (Section 3.6.2) 
and jump (Section 3.6.3) instructions, the outcome of these instructions depends 
on the values of the condition codes. The source value is read from either memory 
or the source register, but it is copied to the destination only if the specified 
condition holds. 

The source and destination values can be 16, 32, or 64 bits long. Single- 
byte conditional moves are not supported. Unlike the unconditional instructions, 
where the operand length is explicitly encoded in the instruction name (e.g., movw 
ard movl), the assembler can infer the operand length of a conditional move 
instruction from the name of the destination'register, and so the same instruction 
name can'be used for all operand lengths. 

Unlike conditional jumps, the processor can execute conditional move in- 
structions without having to predict the outcome of the test. The processor simply 
reads the source value (possibly from memory), checks the condition code, and 
then either updates the destination register or keeps it the same. We will explore 
the implementation of conditional moves in Chapter 4. 

To understand how conditional operations can be implemented via condi- 
tional data transfers, consider the following general form of conditional expression 
and assignment: 
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Instruction Synonym Move condition Description : 
cmove §,R  cmovz ZF Equal / zero j 
cmovne S,R . cmovnz ~ZF Not equal / not zero 
cmovs  S$,R SF Negative 
cmovns S,R ~SF Nonnegative 
cmovg S,R cmovnle ~(SF°OF)&~ZF Greater (signed >) P 
cmovge S,R cmovnl -(SF ^ OF) Greater or equal (signed >=) 
cmovl S,R  cmovnge SF * OF Less (signed <) 
cmovle S,R cmovng (SF ^ OF) | ZF Less or equal (signed <=) 
cmova S,R | cmovnbe ~CF & -ZF Above (unsigned ») 
cmovae S$,R cmovnb -CF Above or equal (Unsigned »-) : 
cmovb  S,R'  cmovnae CF Below (unsigned «) ; 
cmovbe S,R . cmovna CF | ZF Below or equal (unsigned <=). 


Figure 3.18 The conditional move instructions. These instructions copy the source 
value 5 to its destination R when the move condition holds. Some instructions have 


[4 


"synonyms," alternate names for the same machine instruction. 


v = fest-expr ? then-expr : else-expr; 


The standard way to compile this expression using conditional control transfer 
would have the following form: 


if (!test-expr) 
goto false; 
a Vv = then-expr; 
goto dong; 
false: i 
v = else-expr; ! 
done: i 


This code contains two code sequences—one evaluating then-expr and one evalu- j 
ating else-expr. A combination of conditional and unconditional jumps is used to 
ensure that just one of the sequences is evaluated. 

For the code based on a conditional move, both the then-expr and the else- 
expr are evaluated, with the final value chosen based on the evaluation test-expr. j 
This can be described by the following abstract code: 


v- = then-expr; 
ve = else-expr; 
t = fest-expr; 
if (!t) v = ve; 


[i] 


[The final statement in this sequence is implemented, with a conditional move— 
value ve is copied to v only if test condition t does not hold. 
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Not all conditional expressions can be compiled using conditional moves. 
Most significantly, the abstract code we have shown evaluates both then-expr and 
else-expr regardless of the test outcome. If one of those two expressions could 
possibly generate an error condition or a side effect, this could lead to invalid 
behavior. Such is the case for our earlier example (Figure 3.16). Indeed, we put the 
side effects into this example.specifically to force ccc to implement this function 
using conditional transfers. 

As a second illustration, consider the following C function: 


long cread(long *xp) ( 
return (xp ? *xp : 0); 


} 


At first, this seems likea good candidate to compile using a conditional move to 
set the result to zero when the pointer is null, as shown in the following assembly 


code: 


long créad(long *xp) 
Invalid implementation of function cread 
xp in register krdi 


1 cread: 

2 movq (4rdi), *rax v = *xp 

3 testą žrdi, žrdi Test x 

4 movl .. $0, fedx Set ve = 0 

5 cmove Xrdx, %rax If x==0, v = ve 
6 ret Return v 


This implementation is invalid, however, since the derefetencing of xp by the 
movq instruction (line 2) occurs even when the test fails, causing a null pointer 
dereferencing error. Instead, this code must be compiled using branching code. 
Using conditional moves also does not always improve code efficiency. For 
example, if either the then-expr or the else-expr évaluation requires a significant 
computation, then this effort is wasted when the corresponding condition does 
not hold. Compilers must take into account the relative performance of wasted 
computation versus the potential fór performancé pehafty due td branch mispre- 
diction. In truth, they do not really have enough information to make this decision 
reliably; for example, they do not know how well the branches will follow pre- 
dictable patterns. Our experiments with ccc indicate that it only uses conditional 
moves when the two expressioris Can be computed very easily, for example, with 
single add instructions. In our experience, acc uses conditional control transfers 
even in many cases where the cost of branch misprediction would exceed even 


more complex computations. 


Overall, then, we see that conditional data transfers offer an alternative 
strategy to conditional control transfers for implementing conditional operations. 
They can only be used in restricted cases, but these cases are fairly common and 
provide a much better match to the operation of modern processors. 





| 
| 
| 








long arith(long x) { 
return x OP 8; 


} 


long arith(long x) 
x in frdi 

arith: 
leaq 7(%rdi), %rax 
testq %rdi, %rdi 
cmovns rdi, %rax 
sarq $3, %rax 
ret 


A. What'operation is OP? 


Starting with C code of the fori 


long test(long x, long y) { 


long val = . . . . 
iC... 5t 
iB.) 
val - 1 
else 
val =_ n 
} else if (... 2 
val = i 


return val; 


} 
acc generates the following assembly code: 


long test(long x, long y) 
x in Ardi, y in drsi 


test: 
leaq OC, %4rdi,38), %rax 
testq! "Arsi,; Arsi » 


jle .L2 
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In the following C fonction: we e lave left the definition of operation OP incomplete: 


#define OP ___ . — /* Unknown operator */ 


When compiled, acc generates the following assembly code: 


Y ih n á 
B. Annotate the code to explain how, it works. 














movq Arsi, 4rax 


subq frdi, rax i 
movq (rdi, 4rdx 
andq rsi, %xrdx 


cmpq yrsi, 4rdi 
cmovge %rdx, rax 
ret 

.L2: 
addq Xrsi, Ardi 
cmpq $-2, %rsi 
cmovle %rdi, %rax 
ret 


Fill in the missing expressions in the C code. 


3.6.7 Loops 


C provides several looping constructs—namely, do-while, while, and for. No 
corresponding instructions exist in machine code. Instead, combinations of con- 
ditional tests and jumps are used to implement the effect of loops. Gcc and other 
compilers generate loop code based on the two basic loop patterns. We will study 
the translation of loops as a progression, starting with do-while and then working 
toward ones with more complex implementations, covering both patterns. 


Do-While Loops 


The general form of a do-while statement is as follows: Boa 


dó 
body-statement 
while (test-expr) ; 


i 
l 
: 


The effect of the loop is to repeatedly execute body-statement, evaluate test-expr, 
and continue the loop if the evaluation result is nonzero. Observe that body- 
statement is executed at least once. 

This general form can be translated ihto conditionals and goto statements as 


follows: 


loop: ; 
body-statement 
t = test-expr; 
if (t) 
goto loop; 


That is, on each iteration the program evaluates the body statement ‘and then the 
test expression. If the test succeeds, the program. goes back for another iteration. 
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(a) C code (b) Equivalent goto version 
long fact do(long n) long fact, do goto(long n) 
tt 1 
long result = 1; long result = 1; 
do { loop: 
result *- n; result *= n; 
n - n-1; n = n-i; 
} while (n'» 1); if (n > 1) 
return result; goto loop; 
} return result; 
} 


(c) Corresponding assembly-language code 


long fact do(long n) 


n in žrdi 
1 fact do: 
2 movl $1, “eax Set result = 1 
3 .L2: loop: 
4 imulq — ^rdi, %rax Compute result *- n 
5 subq $1, %rdi Decrement n 
6 cmpq $1, žrdi . Compare n:1 
7 jg .L2 If >, goto loop 
8 rep; ret Return 


Figure 3.19 Code for do-while version of factorial program. A conditional jump 
causes the program to loop. 


As an example, Figure 3.19(a) shows an implementation of a routine to com- 
pute the factorial of its argument, written n!, with a do-while loop. This function 
only computes the proper value for n > 0. 





A. What is the maximum value of n for which we can represent n! with a 32-bit 
int? 


B. What about for a 64-bit long? 


oo eee 


The goto code shown in Figure 3.19(b) shows how the loop gets turned into 
a lower-level combination of tests and conditional jumps. Following the initial- 
ization of result, the program begins looping. First it executes the body of the 
loop, consisting here of updates to variables result and n. It then tests whether 
n> 1, and, if so, it jumps back to the beginning of the loop. Figure 3.19(c) shows 
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Aside  Reverse'engineering loops P 
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A key tounderstanding how the generated assembly code relates to the original source code is to firíd a 
mapping betweef program values'aríd registers. This task was simple enough for the loop of Figure 3.19. ; 
but it can be much more challenging fof more complex programs. The C compiler will often rearrange ^ 
the computations, so that some variables in the C code have no counterpart in the machine code, and 
new Values are introduced into the machine code that-do not exist in the source code. Moreover, it will ; 
often try to minimize register usage by mapping multiple program values onto a single register. 

The process we described for fact dò works as a general strategy for reverse engineering.loops. 
Look at how registérs are initialized before, the loop, updated and tested within the loop, and used jj 
after the loop. Each of these provides a clue that can be, combined to solve a puzzle. Be prepared for 
surprising transformations, some of which are clearly, cases where the compiler was able to ‘optimize 


the code, and others where it is hard to explajn why the compiler chose that particular, strategy, — , 
a MK ie SERRA muU SD n etri EOS pP. 
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the assembly code from which the goto code was generated. The conditional jump 
instruction jg (line 7) is the key instruction in implementing a loop. It determines 
whether to continue iterating or to exit the loop. 

Reverse engineering assembly code, such as that of Figure 3.19(c), requires 
determining which registers are used for which program values. In this case, the 
mapping is fairly simple to determine: We know that n will be passed to the 
function in register %rdi. We can see register rax getting initialized to 1 (line 
2). (Recall that, although the instruction has %eax as its destination, it will also 
set the upper 4 bytes of Zrax to 0.) We can see that this register is also updated 
by multiplication on line 4. Furthermore, since %rax is used to return the function 
value, it is often chosen to hold program values that are returned. We therefore 
conclude that %rax corresponds to program value result. 






prs 





Gee 
For the C code 


long dw_loop(long x) { 
long y = x*X; 
long *p = &x; 
long n - 2*x; 
do 1 
x += y; 
(*p)++; 
n--; 
} while (n > 0); 
return X; 


} 


ccc generates the following assembly code: 
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long dw loop(long:x) 
x initially in Xrdi 
dw loop: 
movq žrdi, %rax 
movq ardi, 4rcx 
imulq žrdi, %rcx 
leaq (žrdi, žrdi), %rdx 
.L2: 
leaq 1(%rcx,%rax), "rax 
subq $1, rdx 
testq | 4rdx, %rdx 
jg .L2 
rep; ret 


A. Which registers are used to hold program values x, y, and n? 


B. How has the compiler eliminated the need for pointer variable p and the 
pointer dereferencing implied by the expression (*p)++? 


C. Add annotations to tlie assembly code describing the operation of the pro- 
gram, similar to those shown in Figure 3.19(c). 


While Loops 
The general form of a while statement is as follows: 


while (est-expr) 
body-statement 


It differs from do-while in that test-expr is evaluated and the loop is potentially 
terminated before the first execution of body-statement. 'There are a number of 
ways to translate a while loop into machine code, two of which are used in code 
generated by ccc. Both use the same loop structure as we saw for do-while loops 
but differ in how to implement the initial test. 

The first translation method, which we refer to as jurnp to middle, performs 
the initial test by performing an unconditional jump to the test at the end of the 
loop. It can be expressed by the following template for translating from the general 
while loop form to goto code: 


goto test; 
loop: 
body-statement 
test: 
t = fest-expr ; 
if (t) 
goto loop; 


As an example, Figure 3.20(a) shows an implementation of the factorial func- 
tion using a while loop. This function correctly computes 0! = 1. The adjacent 
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(a) C code (b) Equivalent goto version 


long fact_while(long n) long fact_while_jm_goto(long n) 
i 1 
long result - 1; long result - 1; 
while (n > 1) { goto test; 
result *- n; loop: 
n= n-1; result *= n; 
} n = n-1; 
return result; test: 
if n» 1) 
goto loop; 
return result; 


(c) Corresponding assembly-language code 


long fact while(long n) 
n.in žrdi 
fact while: 
movi $1, eax Set result = 1 
jmp .L5 Goto test 
.L6: loop: 
imulq ‘%rdi, %4rax Compute result *- n 
subq $1, žrdi Decrement n 
test: 
$1, rdi Compare n:1 
.L6 If >, goto loop 
rep; rét Retürn 


à ` EO s ` n " 
Figure 3.20, C and assembly code for while version of factorial using jump-to- 
middle translation. The C function fact, while. jm goto illustrates the operation of 
the assembly-code version. 


H 


function fact while; jm. goto (Figure 3.20(b)) i$ a C rendition of the assembly 
code generated by gbt when optimization is specified with the cómmand-line op- 
tion -0g. Comparing the goto code generated for fact while (Figure 3.20(b)) to 
that for fact. do (Figure 3.19(b)), we see that they are very similar, except that 
the statement goto test before the loop causes the program to first perform the 
test of n before modifying the values of result or n. The bottom portion, of the 
figure (Figure 3.20(c)) shows the actual assembly code generated. 


For C code having the general form 


long loop_while(long a, long b) 
{ 
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long result = _ -— 

while (__ { 
result = L i 
as ; 

} 


return result; 


} 


’ —- 


GCC, run with command-line option -0g, produces the following code: 
t , 


longiloop while(long a, long b) 


a in frdi, b in krsi ; 
1 ,loop while: 
2 movl $1, Zeax 
3 jmp .L2 
4  .L3: 
5 ,leaq Cirdi,4rsi), %rdx 
6 imulq %rdx, %rax; 
7 addq $1, žrdi 
8 .L2: 
9 cmpq drsi, %rdi 
10 jl .L3 
11 rep; ret 


We can see that the compiler used a jump-to-mjddle translation, using the jmp 
instruction on line 3 to jump to the test starting with label .L2. Fill in the missing 


parts of the C code. 
aa ee RE 


The second translation method, which we refer to as guarded do, first trans- 
forms the code into a do-while loop by using a conditional branch to skip over the 
loop if the initial test fails. Gcc follows this strategy when compiling with higher 
levels of optimization, for example, with command-line option -01. This method 

| can be expressed by the following template for translatin g from the general while 
. loop form to a do-while loop: 


t = test-expr! 
if (!t) ` 
goto done; 
do 
body-statement 
while (test-expr); 
: done: 


í, This, in turn, can be transformed into goto code as 


t = test-expr; 
if (!t) 
goto done; 
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loop: 
body-statement 
t = test-expr; 
if (t) 
goto loop; 
done: 


Using this implementation strategy, the compiler can often optimize the initial 
test, for example, determining that the test condition will always hold. 

As an example, Figure 3.21 shows the same C code for a factorial function 
as in Figure 3.20, but demonstrates the compilation that occurs when GCC is 
given command-line option -01. Figure 3.21(c) shows the actual assembly code 
generated, while Figure 3.21 (b) renders this assembly code in a more feadable C 
representation. Referring to this goto code, we see that the loop will be skipped 
if n <1, for the initial value of n. The loop itself has the same general structure 
as that generated for the do-while version of the function (Figure 3.19). One 
interesting feature, however, is that the loop test (line 9 of the assembly code) 
has been changed from n > 1 in the original C code to n:# 1. The compiler has 
determined that the loop can only be entered when n >t, and that décrementing 
n will result in either n > 1 orn = 1. Therefore, the test n Æ 1 will be equivalent to 
the test n < 1. i 












For C'code having the general form 9r n yn 
fa $ 4 v 
long loop_while2(long a, long b) ` 
{ 
long result = ____-___-} 
while Go {í 
result = . .. 3 
b=. — 
y f ma 
return result; 
} a 


Gcc, run with command-line option -01, produces the following code: 


a in žrdi, b in 4rsi : 


1 loop. while2: 

2 testq ‘%rsi, %rsi 

3 jle .L8 

4 movq ‘resi, 4rax 

5 .LT: 

6 imulg "rdi, %rax iiu 
7 subq rdi, 4rsi 

8 testq %rsi, %rsi 
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(a) C code (b) Equivalent goto version 
long fact_while(long n) long fact_while_gd_goto(long n) 
{ 1 
long result - 1; long result - 1; 
while (n > 1) { if (n <= 1) 
result *= n; goto done; 
n = n-i; loop: 
} result *= n; 
return result; n = n-1; 
} if (n !- 1) 
goto loop; 
done: 


return result; 





} 

(c) Corresponding assembly-language code 

long fact while(long n) 

n in žrdi 
1 fact while: 
2 cmpq $1, %rdi Compare n:i 
3 jle .L7 If <=, goto done 
4 movl $1, Xeax Set result = 1 
5 .L6: loop: 
6 imulq %rdi, %rax Compute result *= n 
7 Subq $1, Ardi Decrement n 
8 cmpq $1, žrdi Compare n:1 
9 jne .L6 If !-, goto loop 
10 rep; ret Return 
11 L7: done: 
12 movl $1, %eax Compute result = 1 
13 ret Return 


E Figure 3.21 C and assembly code for vhile version of factorial using guarded- 
+ do translation. The fact while, gd goto function illustrates the operation of the 
assembly-code version. 


9 jg .L7 

10 rep; ret 

n .L8: 

12 movq arsi, %rax 
13 ret 


We can see that the compiler used a guarded-do translation, using the jle 
instruction on line 3 to skip over the loop code when the initial test fails. Fill in 
the missing parts of the C code. Note that the control structure in the assembly 
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code does not exactly match what would be obtained by a direct translation of the 
C code according to our translation rules. In particular, it has two different ret 
instructions (lines 10 and 13). However, you can fill out the missing portions of 
the C code in a way that it will have equivalent behavior to the assembly code. 


G (solution pads 336)......: 
he following overall structure: 
long fun_a(unsigned long x) { 
long val = 0; 
while (...) 4 


} 
return ...; 


} 
The cce C compiler generates the following assembly code: 


long fun_a(unsigned long x) 


$0, eax 
.L5 


rdi, Arax 
rdi Shift right by 1 


Yrdi, 4rdi 
.L6 
$1, “eax 


1 
2 
3 
4 
5 
6 
7 
8 
9 


2 = 
m 


Reverse engineer the operation of this code and then do the following: 
A. Determine what loop translation method was used. 
B. Use the assembly-code version to fill in the missing parts of the C code. 
C. Describe in English what this function computes. 


For Loops 


The general form of a £or loop is as follows: 


for (init-expr; test-expr; update-expr) 
body-statement 
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The C language standard states (with one exception, highlighted in Problem 3.29) 
that the behavior of such a loop is identical to the following code using a while 
loop: 


init-expr ; 

while (test-expr) 1 
body-statement 
update-expr ; 


The program first evaluates the initialization expression init-expr. It enters a 
loop where it first evaluates the test condition test-expr, exiting if the test fails, then 
executes the body of the loop body-statement, and finally evaluates the update 
expression update-expr. 

The code generated by acc for a £or loop then follows one of our two trans- 
lation strategies for while loops, depending on the optimization level. That is, the 
jump-to-middle strategy yields the goto code 


init-expr ; 

goto test; 
loop: 

body-statement 

update-expr ; 
test: 

t = test-expr; 

if (t) 

goto loop; 


while the guarded-do strategy yields 


init-expr ; 
t = test-expr; 
if (!t) 
goto done; 
loop: 
body-statement 
update-expr ; 
t = test-expr; 
if (t) 
goto loop; 
done: 


As examples, consider a factorial function written with a for loop: 


long fact_for(long n) 
{ 


long i; 
long result = 1; 
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for (i = 2; i <= n; i++) 
result *= i; 
return result; 


As shown, the natural way of writing a factorial function with a for loop is 
to multiply factors from 2 up to n, and so this function is quite different from the 
code we showed using either a while or a do-while loop. 

We can identify the different components of the for loop in this code as 
follows: y " 
inil-expr i=2 
test-expr i<=n 
update-expr i++ 
body-statement result *-i; "TET 





Substituting these components into the template we have shown to transform a 
for loop into a while loop yields the following: 


long fact for while(long n) 
{ 
long i = 2; 
long result = 1; 
while (i <= n) { 
result *= i; 
itt; 
} 


return result; 


P, 


Applying the jump-to-middle transformation to the while loop then yields the 
following version in goto code: 


long fact for jm goto(long n) 
1 

long i - 2; 

long result = i; 

goto test; 
loop: 

result *- i; 

i++; 
test: 

if (i <= n) 

goto loop; 
return result; 
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Indeed, a close examination of the assembly code produced by ccc with 
command-line option ~Og closely follows this template: 


long fact for(long n) 


n in frdi 

fact for: 
movi $1, 4eax Set result = 1 
movil $2, %edx Set i=2 
jup .L8 Goto test 

.L9: loop: 
imulq ‘%rdx, %rax Compute result *= i 
addq $1, %rdx Increment i 

.L8: : test: 
cmpq žrdi, 4rdx Compare i:n 
jle .L9 If <=, goto loop 
rep; ret Return 


RES RN P NUN iati: "i EANET, 


$ 





Practice Problem: 3:27- (solution page 336y .. . ^ at, Meche teal Ae a 
Write goto code for fact_for based on first transforming it to a while loop and 
then applying the guarded-do transformation. 








We see from this presentation that all three forms of loops in C—do-while, 
; while, and for—can be translated by a simple strategy, generating code that con- 
tains one or more conditional branches. Conditional transfer of control provides 
the basic mechanism for translating loops into machine code. 





a D Puget pil Od HI END lala T par P. i um Kors T ya " wi ^ 
E Pr. t Pr 28(solution page 336)... inani man SH an, P aes eit nt edi a ok 


A function fun, b has the following overall structure: 





long fun b(unsigned long x) { 


A long val = 0; 

k long i; 

[: for C joe av $t 
E 

P. } 


return val; 


The ccc C compiler generates the following assembly code: 


long fun_b(unsigned long x) 
x in Zrdi 

1 fun, b: 

2 movl $64, %edx 
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3 movil $0, Zeax 

4 .L10: 

5 movq Xrdi, Arcx 

6 andl $1, ecx 

7 addq %rax, 4rax 

8 orq drcx, 4rax 

9 shrq “rai Shift right by 1 
10 subq $1, %rdx 

11 jne .L10 

12 rep; ret 


1 


Reverse engineer the operation of this code and then do the following: 


A. Use the assembly-code version to fill in the missing parts of the.C code. 


B. Explain why there is neither an initial test before the loop nor an initial jump 
to the test portion of the loop. 


C. Describe in English what this function computes. 


FOE 


Executing n C causes the program to jump to the end of 
the current loop iteration. The stated rule for translating a for loop into a while 
loop needs some refinement-when dealing with continue statements. Forexample, 
consider the following code: bo « 
/* Example of for loop containing a continue statement */ 
/* Sum even numbers between 0 and 9 */ 
long sum = 0; 
long i; a! 
for (i = 0; i < 10; i++) { d 
if (i & 1) 
continue; 
sum += i; 


} 


A. What would we get if we naively applied our rule for translating the for loop 
into a while loop? What would be wrong with this code? 


B. How could you replace the continue statement with a goto statement to 
ensure that the while loop correctly duplicates the behavior of the f or loop? 


3.6.8 Switch Statements 


A switch statement provides a multiway branching capability based on the value 
of an integer index. They are particularly useful when dealing with tests where 
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there can be a large number of possible outcomes. Not only do they make the C 
code more readable, but they also allow an efficient implementation using a data 
structure called a jump table. A jump table is an array where entry i is the address of 
acode segment implementing the action the program should take when the switch 
index equals i, The code performs an array reference into the jump table using the 
switch index to determine the target for a jump instruction. The advantage of using 
a jump table over a long sequence of if-else statements is that the time taken to 
perform the switch is independent of the number of switch cases. Gcc selects the 
method of translating a switch statement based on the number of cases and the 
sparsity of the case values. Jump tables are used when there are a number of cases 
(e.g., four or more) and they span a small range of values. 

Figure 3.22(a) shows an example of a C switch statement. This example has a 
number of interesting features, including case labels that do not span a contiguous 
range (there are no labels for cases 101 and 105), cases with multiple labels (cases 
104 and 106), and cases that fail through to other cases (case 102) because the code 
for the case does not end with a break statement. 

Figure 3.23 shows the assembly code generated when compiling switch eg. 
The behavior of this code is shown in C as the procedure switch, eg, impl in 
Figure 3.22(b). This code makes use of support provided by ccc for jump tables, 
as an extension to the C language. The array jt contains seven entries, each of 
which is the address of a block of code. These locations are defined by labels in 
the code and indicated in the entries in jt by code pointers, consisting of the labels 
prefixed by &&. (Recall that the operator '&' creates a pointer for a data value. In 
making this extension, the authors of ccc created a new operator && to create 
a pointer for a code location.) We recommend that you study the C procedure 
switch, eg impl and how it relates to the assembly-code version. 

Our original C code has cases for values 100, 102-104, and 106, but the switch 
variablen can be an arbitrary integer. The compiler first shifts the range to between 
0 and 6 by subtracting 100 from n, creating a new program variable that we call 
index in our C version. It further simplifies the branching possibilities by treating 
index as an unsigned value, making use of the fact that negative numbers in a 
two's-complement representation map to large positive numbers in an unsigned 
representation. It can therefore test whether index is outside of the range 0-6 
by testing whether it is greater than 6. In the C and assembly code, there are 
five distinct locations to jump to, based on the value of index. These are loc_A 
(identified in the assembly code as .L3), loc. B (.L5), 1oc. C (. L6), loc_D (.L7), 
and loc_def (.L8), where the latter is the destination for the default case. Each 
of these labels identifies a block of code implementing one of the case branches. 
In both the C and the assembly code, the program compares index to 6 and jumps 
to the code for the default case if it is greater. 

The key step in executing a switch statement is to access a code location 
through the jump table. This occurs in line 16 in the C code, with a goto statement 
that references the jump table jt. This computed goto is supported by Gcc as an 
extension to the C language. In our assembly-code version, a similar operation 
occurs on line 5, where the jmp instruction's operand is prefixed with ‘+’, indicating 


Control 
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(a) Switch statement 


void switch_eg(long x, long n, 
long *dest) 
1 


long val = x; 
switch (n) { 


case 100: 
val, *- 
break; 


case 102: 
val *- 10; 
/* Fall through */ 


case 103: 
val += 11; 
break; 


case 104: 

case 106: 
val *= 
break; 


default: 
val = 0; 
) 


*dest = val; 


(b) Translation into extended C 


1 
2 
3 
4 
5 
6 
7 
8 
9 


10 
11 
12 


void switch eg impl(1long x, long n, 


1 


Yoc def: 


loc: A: 


loc.B: 


loc C: 


lbng *dest) 


/* Table of code pointers */ 

static void *jt[7] = f 

! — &gloc A, &&loc def, &&loc. B, 
kkloc C, &&loc D, &&loc def, 
k&loc D 

3; 

unsigned’ long index = n 2 100; 

long val; ^" 


if (index > 6) 
goto loc def; 
/* Multiway branch */ 
goto *jt [index]; 
* WW 


/* Case 100 */ 
val = x * 13; 

goto done; 

/* Case 102 */ 
z= x + 10; 

/* Fall through */ 

E Jx Case 103 */ 
val = x + 11; 


goto done; 
loc_D: 


/* Cases 104, 106 */ 
val = x * x; 

goto done? 

/* Default case */ 


val 0; 


done: 


} 


*degt! = val; 


ath 


1 
Figure 3.22 Example switch statement and its translation into extended C. The translation shows the 
structure of jump table jt and how it is accessed. Such.tables are supported by GCC as.an extension to the C 


language. 


an indirect jump, and the operand specifies amemory lotation indexed by register 
Yeax, which holds the value of index. (We will see in Section 3.8 .how array 
references are translated into machine code.) bs 

Our C code declares thé jump, table as-an array of seven elements, each 
of which is a pointer to a code location. These elements span values 0-6 of 








4 
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void switch eg(long x, long n, long *dest) 


X in žrdi, n in Zrsi, dest in Ardx 
: Switch, eg: 

subq $100, %rsi 

cmpq $6, Zrsi 


Compute index = n-100 
Compare index:6 


ja .L8 If >, goto loc def 
jmp *.L4(,%rsi,8) Goto *jg[index] 
.L3: loc À: 
leaq Cirdi,%rdi,2), %rax 34x 
leaq Cirdi,4rax,4), %rdi val = 13*x 
jmp .L2 Goto done 
.L5: loc. B: 
addq $10, žrdi x=x +10 
.L6: loc C: 
addq $11, Urdi val = x + 11 
jmp .L2 Goto done 
.L7: loc.D: 
imulq %rdi, %rdi val-x*x 
jmp .L2 ‘Goto done 
.L8: loc, def: 
movl $0, %edi val = 0 
.L2: done: 
movq žrdi, (%rdx) *dest = val 
ret Return 


Figure 3.23 Assembly code for switch statement example in Figure 3.22. 
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index, corresponding to values 100-106 of n. Observe that the jump table handles 
duplicate cases by simply having the same code label (1oc. D) for entries 4 and 6, 
and it handles missing cases by using the label for the default case (1oc. def) as 
entries 1 and 5. 

E Inthe assembly code, the jump table is indicated by the following declarations, 
to which we have added comments: 

















1 .Section .rodata 
E 2 .align 8 Align address to multiple of 8 
E 3 — L4 

4 -quad .L3 Case 100: loc.A 

5 .quad .L8 Case 101: loc def 

6 -quad .L5 Case 102: loc.B 

7 -quad — .L6 Case 103: loc C 

8 .quad — .L7 Case 104: loc. D 

9 . quad .L8 Case 105: loc def 









-quad .L7 Case 106: loc D 
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These declarations state that within the segment of the object-code file called 
.rodata (for “read-only data"), there should be a sequence of seven “quad” (8- 
byte) words, where the value of each word is given by the instruction address 
associated with the indicated assembly-code labels (e.g., .L3). Label .L4 marks 
the start of this allocation. The address associated with this label serves as the 
base for the indirect jump (line 5). 

The different code blocks (C labels 1oc. 4 through loc_D and loc. def) im- 
plement the different branches of the switch statement. Most of them simply 
compute a value for val and then go to the end of the function. Similarly, the 
assembly-code blocks compute a value for register %rdi and jàmp to the'position 
indicated by label .L2 at the end of the function. Only the code for case label 102 
does not follow this pattern, to account for the way the code for this'case falls 
through to the block with label 103 in the original C code. This is handled in the 
assembly-code block starting with label .L5, by omitting the jmp instruction at 
the end of the block, so that the code continues execution of the next blocK. Simi- 
larly, the C version switch. eg implhasno goto statement at the end of the block 
starting with label loc_B. 

Examining all of this code requires careful study, but the key point is to see 
that the use of a jump table allows a very efficient way to implement a multiway 
branch. In our case, the program could branch to five distinct locations with a 
single jump table reference. Even if we had a switch statement with hundreds of 
cases, they could be handled by a single jump table access. 





In the C function that follows, we have omitted the body of the switch statement. 
In the C code, the cage labels did not spán a contiguous range, and some cases had 
multiple labels. 


void switch2(long x, long *dest) { 
long val = 0; 
switch (x), { 


Body of switch statement omitted - 
} ! 


*dest - val; 


In compiling the function, acc generates the assembly code that follows for 
the initial part of the procedure, with variable x in %rdi: 


void switch2(long x, long *dest) 


x in &rdi 
1 switch2: 
2 addq $1, rdi 
3 cmpq $8, %rdi 
4 ja .L2 
5 jmp *.L4(,%rdi, 8) 
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It generates the following code for the jump table: 


CQ wo CÓ ^ CO tà d» v ON = 


.L4: 
quad .L9 
quad — .L5 
.quad X .L6 
quad .L7 
.quad — .L2 
-quad | .L7 
.quad — .L8 
.quad .L2 
-quad = .L5 


Based on this information, answer the following questions: 


A. What were the values of the case labels in the switch statement? 
B. What cases had multiple labels in the C code? 


ip 


For a C function switcher with the general st 


QUe E KEAT Y 
8 


CCP 
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ructure 


void switcher(long a, long b, long c, long *dest) 


1 


} 


long val; 
switch(a) ( 
case nni /* 
c= i 
/* Fall through */ 
case : /* 
val = 0. 5 
break; 
case RM. 
case 
val - 
break; 
case 
val = 
break; 
default: 
val - 
} 


*dest = val; 


Gcc generates the assembly code and jump table shown in Figure 3.24. 
Fill in the missing parts of the C code. Except for the ordering of case labels 
C and D, there is only one way to fit the different cases into the template. 
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(a) Code 


void switcher(long a, long b, long c, long *dest) 
a in %rsi, b in %rdi, c in %rdx, d in Zrcx 
switcher: 
cmpq $7, žrdi 
ja .L2 
jmp *.LAC, 4rdi,8) 
.Section .xodata 
.L7: 
xorq $15, %rsi 
movq Xrsi, 4rdx 
.L3: 
leaq  112(%rdx), %rdi 
jmp .L6 
.L5: 
leaq (%rdx,%rsi), %rdi 
salq $2, žrdi 
jup .L6 
.L2: 
movq Yrsi, žrdi 
.L6: 
movq rdi, (%rcx) 
20 ret 


(b) Jump table 


.L4: 
.quad .L3 
.quad .L2 
.quad — .L5 
.quad .L2 
.quad .L6 
.quad  .L7 
.quad X .L2 
-quad — .L5 


w wan DAY d w ll — 
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Figure 3.24 Assembly code and jump table for Problem 3.31. 


3.7 Procedures 


£ 


Procedures are a key abstraction in software. They provide a way to package code 

that implements some functionality with a designated set of arguments and an j 
optional return value. This function can then be invoked from different pointsin | 
a program. Well-designed software uses procedures as an abstraction mechanism, 

hiding the detailed implementation of some action while providing a clear and | 
concise interface definition .of,what values will be computed and what effects | 
the procedure will have on the program state. Procedures come in many guises | 
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in different programming languages—functions, methods, subroutines, handlers, 
and so on—but they all share a general set of features. 

There are many different attributes that must be handled when providing 
machine-level support for procedures. For discussion purposes, suppose proce- 
dure P calls procedure Q, and Q then executes and returns back to P. These actions 
involve one or more of the following mechanisms: 


Passing control. The program counter must be set to the starting address of the 
code for Q upon entry and then set to the instruction in P following the 
call to Q upon return. 


Passing data. P must be able to provide one or more parameters to Q, and Q must 
be able to return a value back to P. 


Allocating and deallocating memory. Q may need to allocate space for local 
variables when it begins and then free that storage before it returns. 


The x86-64 implementation of procedures involves a combination of special 
instructions and a set of conventions on how to use the machine resources, such as 
the registers and the program memory. Great effort has been made to minimize 
the overhead involved in invoking a procedure. As a consequence, it follows what 
can be seen as a minimalist strategy, implementing only as much of the above set 
of mechanisms as is required for each particular procedure. In our presentation, 
we build up the different mechanisms step by step, first describing control, then 


data passing, and, finally, memory management. 


37.1 The Run-Time Stack 


A key feature of the procedure-calling mechanism of C, and of most other lan- 
guages, is that it can make use of the last-in, first-out memory management disci- 
pline provided by a stack data structure. Using our example of procedure P calling 
procedure Q, we can see that while Q is executing, P, along with any of the proce- 
dures in the chain of calls up to P, is temporarily suspended. While Q is running, 
only it will need the ability to allocate new storage for its local variables or to set up 
acallto another procedure. On the other hand, when Q returns, any local storage it 
has allocated can be freed. Therefore, a program can manage the storage required 
by its procedures using a stack, where the stack and the program registers store 
the information required for passing control and data, and for allocating memory. 
As P calls Q, control and data information are added to the end of the stack. This 
information gets deallocated when P returns. 

As described in Section 3.4.4, the x86-64 stack grows toward lower addresses 
and the stack pointer %rsp points to the top element of the stack. Data can be 
stored on and retrieved from the stack using the pushq and popq instructions. 
Space for data with no specified initial value can be allocated on the stack by simply 
decrementing the stack pointer by an appropriate amount. Similarly, space can be 
deallocated by incrementing the stack pointer. 

When an x86-64 procedure requires storage beyond what it can hold in reg- 
isters, it allocates space on the stack. This region is referred to as the procedure's 
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Stack “bottom” 






i Figure 3.25 
Ü General stack frame 
d structure. The stack 
3 can be used for passing 
ii arguments, for storing 
:] return information, for 
-Z d saving registers, and for 
} local storage. Portions 
if may be omitted when not 
i needed. 
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Frame for calling 
function P 














Increasing 
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Argument 7 
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Saved registers 
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Local variables 





“Frame for executing 
« [^ function Q 










Argument 
build area 








Lo, | Stack poiñter 
Zrsp 





Stack "top" 










stack frame. Figure 3.25 shows the overall structure of the rün-time stack, includ- 
ing its partitioning into stack frames, in its most general form. The frame for tbe 
currently executing procedure is always at'the top of the stack. When procedure P 
calls procedure Q, it will push the return address onto the'stack, indicating where 
within P the program should resume execution once Q returns. We consider the 
return address to be part of P's stack.frame, since it holds state relevant to P. The 
code for Q allocates the space required for its stack frame by extending the cur- 
rent stack boundary. Within that space, it can save the values of registers, allocate 
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space for local variables, and set up arguments for the procedures it calls. The 
stack frames for most procedures are of fixed size, allocated at the beginning of 
the procedure. Some procedures, however, require variable-size frames. This issue 
is discussed in Section 3.10.5. Procedure P can pass up to six integral values (i.e., 
pointers and integers) on the stack, but if Q requires more arguments, these can 
be stored by P within its stack frame prior to the call. 

In the interest of space and time efficiency, x86-64 procedures allocate only 
the portions of stack frames they require. For example, many procedures have 
six or fewer arguments, and so all of their parameters can be passed in registers. 
Thus, parts of the stack frame diagrammed in Figure 3.25 may be omitted. Indeed, 
many functions do not even require a stack frame. This occurs when all of the local 
variables can be held in registers and the function does not call any other functions 
(sometimes referred to as a leaf procedure, in reference to the tree structure of 
procedure calls). For example, none of the functions we have examined thus far 
required stack frames. 


3.7.2 Control Transfer 


Passing control from function P to function Q involves simply setting the program 
counter (PC) to the starting address of the code for Q. However, when it later 
comes time for Q to return, the processor must have some record of the code 
location where it should resume the execution of P. This information is recorded 
in x86-64 machines by invoking procedure Q with the instruction call Q. This 
instruction pushes an address A onto the stack and sets the PC to the beginning 
of Q. The pushed address A is referred to as the return address and is computed 
as the address of the instruction immediately following the ca11 instruction. The 
counterpart instruction ret pops an address A off the stack and sets the PC to A. 

The general forms of the call and ret instructions are described as follows: 


Instruction Description 
cell Label Procedure call 
call *Operand Procedure call 


ret Return from call 


Y 


(These instructions are referred to as callq and retq in the disassembly outputs 
generated by the program oBIJDUMP. The added suffix 'q' simply emphasizes that 
these are x86-64 versions of call and return instructions, not IA32. In x86-64 
assembly code, both versions can be used interchangeably.) 

The call instruction has a target indicating the address of the instruction 
where the called procedure starts. Like jumps, a call can be either direct or indirect. 
In assembly code, the target of a direct call is given as a label, while the target of 
an indirect call is given by '*' followed by an operand specifier using one of the 
formats described in Figure 3.3. 





0x400563 "  0x400540 0x400568 
Ox7£fffffe840 Ox7fffffffe838 OxTfffffffe840 


(a) Executing call (b) After call (c) After ret 


Figure 3.26 Illustration of call and ret functions. The.cal1 instruction transfers control to the start of a, 
function, while the ret instruction returns back to the instruction following the call. 


Figure 3.26 illustrates the execution of the call and ret instructions for the 
multstore and main functions introduced in Section 3.2.2. The following are 
excerpts of the disassembled code for the two functions: 


Beginning of function multstore 

0000000000400540 <multstore>: 
400540: 53 push %rbx 
400541: 48 89 d3 mov rdx, %rbx 


Return from function multstore 


40054d: c3 retq 


Call to multstore from main 
400563: e8 d8.ff ff ff callq 400540 <multstore> 
400568: 48 8b 54 24 08 mov 0x8 (Arsp) , Ardx 


In this code, we can see that the call instruction with address 0x400563 in 
main calls function multstore. This status is shown in Figure 3.26(a), with the 
indicated values for the stack pointer Xrsp and the program counter rip. The 
effect of the call is to push the return address 0x400568 onto the stack and to jump 
to the first instruction in function multstore, at address 0x0400540 (3.26(b)). 
The execution of function multstore continues until it hits the ret instruction 
at address 0x40054d. This instruction pops the value 0x400568 from the stack 
and jumps to this address, resuming the execution of main just after the call 
instruction (3.26(c)). 

Asa more detailed example of passing control to and from procedures, Figure 
3.27(a) shows.the disassembled code for two functions, top and leaf, as well as 
the portion of code in function main where top gets called. Each instruction is 
identified by labels L1-L2 (in leaf), T1-T4 (in top), and M1-M2 in main. Part (b) 
of the figure shows a detailed trace of the code execution, in which main calls 
top(100), causing top to call Leaf (95). Function 1eaf retürns 97 to top, which 
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(a) Disassembled code for demonstrating procedure calls and returns 


Disassembly of leaf(long y) 


_¥ in rdi 
1 0000000000400540 <leaf>: 
2 400540: 48 8d 47 02 lea 0x2 (%rdi) ,%rax Li: 2t2 
3 400544: c3 retq L2: Return 
4 0000000000400545 «top»: 
Disassembly of top(long x) 
x in krdi 
5 400545: 48 83 ef O5 sub $0x5,'4rdi Ti: x-5 
6 400549: eB f2 ff ff ff callq 400540 «leaf» 72: Call leaf(x-5) 
7 40054e: 48 O1 cO add rax, rax T3: Double result 
8 400551: c3 retq T4: Return 
Call to top from function main 
9 40055b: e8 e5 ff ff ff callq 400545 <top> Mi: Call top(100) 
10 400560: 48 89 c2 mov %rax ,Ardx M2: Resume 
(b) Execution trace of example code 
Instruction State values (at beginning) 
Label PC Instruction %rdi ‘%rax Xrsp *ArSp Description 
M1 0x40055b  callq 100 — Ox7fffffffe820 — Call top(100) 
T1 0x400545 sub 100 — Ox7Tfffffffe818  Ox400560 Entry of top 
T2 0x400549  callq 95 — Ox7fffffffe818 0x400560  Callleaf(95) 
L1 0x400540 lea 95 — Ox7fffffffe810 0x40054e Entry of leaf 
L2 0x400544 retq — 97 Ox7Tfffffffe810  Ox40054e Return 97 from leaf 
13 Ox40054e add — 97 Ox7fffffffe818 0x400560 Resume top 
T4 Ox400551  retq — 194 Ox7ffrffffe8is  Ox400560 Return 194 from top 
M2 0x400560 mov — 194 Ox7fffffffe820 -— Resume main 


Figure 3.27 Detailed execution of program involving procedure calls and returns. Using the stack to 
store return addresses makes it possible to return to the right point in the procedures. 


then returns 194 to main. The first three columns describe the instruction being 
executed, including the instruction label, the address, and the instruction type. The 
next four columns show the state of the program before the instruction is executed, 
including the contents of registers 4rdi, 4rax, and Arsp, as well as the value at 
the top of the stack. The contents of this table should be studied carefully, as they 
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; demonstrate the important role of the run-time stack in managing the storage 
i needed to support procedure calls and returns. 

j Instruction L1 of leaf sets {rax to 97, the value to be returned. Instruction L2 
3 i then returns. It pops 0x400054e from the stack. In setting the PC to this popped 
Y value, control transfers back to instruction T3 of top. The program has successfully 
04 completed the call to leaf and returned to top. 

ze nien Instruction T3 sets %rax to 194, the value to be returned from top. Instruction 
T T4 then returns. It pops 0x4000560 from the stack, thereby setting the PC to 
aM instruction M2 of main. The program has successfully completed the call to top 
E and returned to main. We see that the stack pointer has also been restored to 
d Ox7fffffffeB820, the value it had before the call to top. 

big We can see that this simple mechanism of pushing the return address onto 
the stack makes it possible for the function to later return td the proper point 
in the program. The standard ¢all/return mechanism of C (and of most program- 
ming languages) conveniently matches the last-in, first-out memory management 
discipline provided by a stack. 























with the code for a call of first by function main: 7 






Disassembly of last(long u, long v) 
u in rdi, v in &rsi 








1 0000000000400540 «last»: 
E 2 400540: 48 89 £8 mov Yxai, 4rax Li: u 
k 3 400543: 48 Of af c6 imul %rsi,%rax L2: u*v 

4 400547: c3 retq L3: Return 







; Disassembly of last(long x) 
i 1 x in Xrdi 









E 5 0000000000400548 «first»: 
i 6 400548: 48 8d 77 O1 lea  Oxi(érdi),%#5i F1: x«i 
d 7 40054c: 48 83 ef'O1 sub ` $0xi,Xrdi F2: x-1 
E 8 400580: e8 eb ff ff ff callq. 400540 «last» F3: Call last (x-1,x+1) 
9 400555: f3 c3 repz retq F4: Return 








Ji 400560: 68 e3 ff ff ff callq 400548 <first> Mi: Call first (10) 
| i n 400565: 48 89 c2 mov rax, %rdx M2: Resume 











Each of these instructions is given a label, similar to those in Figure 3.27(a). 
Starting with the calling of first(10) by main, fill in the following table to trace 
instruction execution through to the point where the program returns back to 
main. 
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Instruction State values (at beginning) 
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Label PC Instruction Y%rdi %rsi Y%rax “rsp *Zcvsp Description 


M1 0x400560 callq 10 — = Ox7fffffffe820 


| 
L 
| 


3.7.3 Data Transfer 


In addition to passing control to a procedure when called, and then back again 
when the procedure returns, procedure calls may involve passing data as argu- 
ments, and returning from a procedure may also involve returning a value. With 
x86-64, most of.these data passing to and from procedures take place via regis- 
ters. For example, we have.already seen numerous examples of functions where 
arguments are passed in registers žrdi, 4rsi, and others, and where values are re- 
turned in register 4rax. When procedure P calls procedure Q, the code for P must 
firstcopy the arguments into the proper registers. Similarly, when Q returns back 
to P, the code for P can'access the returned value in register %rax. In this section, 
we explore these conventions in greater detail. 

With x86-64, up to six integral (i.e., integer and pointer) arguments can be 
passed via registers. The registers are used in a specified order, with the name 
used for a register depending on the size of the data type being passed. These are 
shown in Figure 3.28. Arguments are allocated to these registers according to their 


Argument number 


Operand LECT edt P dc NC E 

size (bits) 1 2 3 4 5 6 
64 “rdi #rsi %rax %rex 4x8 %r9 
32 Vedi fesi hedx %ecx 4r8d Xr9d 
16 Zdi Asi dx Acx 4r8w Arow 


%dil %sil %d1 %cl %r8b %x9b 







Figure 3.28 Registers for passing function arguments. The registers are used in a 
specified order and named according to the argument sizes. 


| 
l4 
E 
EE 


Call first (10) 
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ordering in the argument list. Arguments smaller than 64 bits can be accessed using 
the appropriate subsection of the 64-bit register. For example, if the first argument 
is 32 bits, it can be accessed as %edi. 

When a function has more than six integral arguments, the other ones are 
passed on the stack. Assume that procedure P calls procedure Q with n integral 
arguments, such that n > 6. Then the code for P must allocate a stack frame with 
enough storage for arguments 7 through n, as illustrated in Figure 3.25. It copies 
arguments 1-6 into the appropriate registers, and it puts arguments 7 through n 
onto the stack, with argument 7 at the top of the stack. When passing parameters 
on the stack, all data sizes are rounded up to be multiples of eight. With the 
: arguments in place, the program can then execute a call instruction to transfer 
| control to procedure Q. Procedure Q can access its arguments via registers and 
? possibly from the stack. If Q, in turn, calls some function that has more than six 
arguments, it can allocate space within its stack frame for these, as is illustrated 
by the area labeled "Argument build area" in Figure 3.25. 

As an example of argument passing, consider the C function proc shown in 
: Figure 3.29(a). This function has eight arguments, including integers with different 

$ numbers of bytes (8, 4,2, and 1), as well as different types of pointers, each of which 
E is 8 bytes. 

R The assembly code generated for proc is shown in Figure 3.29(b). The first 
E ! six arguments are passed in registers. The last two are passed on the stack, as 

: documented by the diagram of Figure 3.30. This diagram shows the state of the 
stack during the execution of proc. We can see that the return address was pushed 
onto the stack as part of the procedure call. The;two arguments, therefore, are 
at positions 8 and 416 relative to the stack pointer. Within the code, we can see 
that different versions of the ADD instruction are used according to the sizes of the 
operands: addq for a1 (Long), add1 for a2 (int), addw for a3 (short), and addb for 
a4 (char). Observe that the mov] instruction of line 6 reads 4 bytes from memory; 
the following addb instruction only makes use of the low-order byte. 





The function has the following body: 


*u += a; 
*v += b; 
return sizeof(a) * sizeof(b); 


It compiles to the following x86-64 code: 


1 procprob: 

2 movslq %edi, rdi 
3 addq žrdi, (%rdx) 
4 addb 4sil, Chrcx) 
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(a) C code 


E void proc(long at, long *aip, 
int a2, int. ¥a2p, 
short a3, short *a3p, 
char a4, char *a4p) 


{ 
E *alp += al; 
; *a2p += a2; 
*a3p += a3; 
*aáp += a4; 

} 


(b) Generated assembly code 


void proc(ai, alp, a2, a2p, a3, a3p, a4, a4p) 
: Arguments passed as follows: 
E ai in Xrdi (64 bits) 


aip- in Zrsi (64 bits) 

a2 in Xedx (32 bits) 

a2p in Zrcx (64 bits) 

a3 in kréy (16 bits) 

a3p in %r9 (64 bits) 

a4 at Arspts ( 8 bits) 
q a4p at Xrsptié (64 bits) 
$: 1 proc: 
i 2 movq 16(Àr5p), %rax Fetch a4p (64 bits) 
: 3 ‘addq ardi, (Xrsi) *aip += ai (64 bits) 
E 4 addl Zedx, (rex) *a2p *- a2 (32 bits) 
E 5 addw — "Ar8w, (%r9) *a3p += a3 (16 bits) 
$ 6 movi 8Cirsp), hedx Fetch a4 ( 8 bits) 
3 7 addb dl, (hrax) *a4p += a4 ( 8 bits) 
8 ret Return & 


Figure 3.29 Example of function with multiple arguments of different types. 
Arguments 1-6 are passed in registers, while arguments 7-8 are passed on the stack, 


Figure 3.30 
Stack frame structure for 





function proc. Arguments 8 
E a4 and a4p'are passed on à Stack poj 
k: : pointer 
E théstack. Return address 0 ~~ «sp 
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movl $6, eax 
ret 


Determine a valid ordering and types of the four parameters. There are two 
correct answers. 


3.7.4 Local Storage on the Stack . 


Most of the procedure examples we have seen so far did not require any local 
storage beyond what could be held in registers. At times, however, local data must 
be stored in memory. Common cases of this include these: 


* There are not enough registers to hold all of the local data. 


° The address operator ‘&’ is applied to a local variable, and hence we must be 
able to generate an address for it. 


© Some of the local variables are arrays or structures and hence must be accessed 
by array or structure references. We will-discuss this possibility when we 
describe how arrays and structures are allocated. 


Typically, a procedure allocates space on the stack frame by decrementing the 
stack pointer. This results in the portion of the stack frame labeled “Local vari- 
ables” in Figure 3.25. 

As an example of the handling of the address operator, consider tbe two 
functions shown in Figure 3.31(a). The function swap. add swaps the two values 
designated by pointers xp and yp and also returns the sum of the two values. The 
function caller creates pointers to local variables arg1 and arg2 and passes these 
to swap. add. Figure 3.31(b) shows how caller uses a stack frame to implement 
these local variables. The code for caller starts by decrementing the stack pointer 
by 16; this effectively allocates 16 bytes on the stack. Letting S denote the value of 
the stack pointer, we can see that the code computes &arg?2 as S + 8 (line5), &argt 
as 5 (line 6). We can therefore infer that local variables argi and arg2 are stored 
within the stack frame at offsets 0 and 8 relative to the stack pointer. When the call 
to swap_add completes, the code fot caller then retrieves the two values from 
the stack (lines 8-9), computes their difference, and multiplies this by the value 
returned by swap. add in register %rax (line 10). Finally, the function deallocates 
its stack frame by incrementing the stack pointer by 16 (line 11.) We can see with 
this example that the run-time stack provides a simple mechanism for allocating 
local storage when it is required and deallocating it when the function completes. 

As a more complex example, the function call proc, shown in Figure 3.32, 
illustrates many aspects of the x86-64 stack discipline. Despite the length of this 
example, it is worth studying carefully. It'shows a function that must allocate 
storage on the stack for local variables, as well as to pass values to the 8-argument 
function proc (Figure 3.29). The function creates a stack frame, diagrammed in 
Figure 3.33. 

Looking at the assembly code for call. proc (Figure 3.32(b)), we can see 
that a large portion of the code (lines 2-15) involves preparing to call function 
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(a) Code for swap_add and calling function 


long swap_add(long *xp, long *yp) 
{ 

long x = *xp; 

long y = *yp; 

*xXp = y; 

*yp = Xi 

return x + y; 


F 


long caller() 
{ 
long argi = 534; 
long arg2 = 1057; 
long sum = swap add(&argi, &arg2); 
long diff = argi - arg2; 
return sum * diff; 


} 
(b) Generated assembly code for calling function 


long caller() 

1 caller: 

2 subq $16,- #rsp Allocate 16 bytes for stack frame 
3 movq $534, (%rsp) Store 534 in argi 

4 movq $1057, 8(%rsp) Store 1057 in arg2 

5 leaq 8 rsp), %rsi Compute £arg2 as second argument 
6 movq ‘rsp, %rdi Compute £argi as first argument 
7 call Swap. add Call swap add(£argi!, &£arg2) 

8 movq (sp), %rdx Get argi 

9 subq 8Cirsp), %rdx Compute diff = argi - arg? 
imulq Z%rdx, %rax Compute sum * diff 

addq $16, %rsp Deallocate stack frame 

ret Return 


Figure3.31 Example of procedure definition and call. The calling code must allocate 
a stack frame due to the presence of address operators. 


proc. This includes setting up the stack frame for the local variables and function 
parameters, and for loading function arguments into registers. As Figure 3.33 
shows, local variables x1-x4 are Allocated’on the stack ‘and have different sizes. 
Expressing their locations as offsets relative to the stack pointer, they occupy bytes 
24-31 (x1), 20-23 (x2), 18-19 (x3), and 17 (s3). Pointers to these locations are 
generated by leagq instructions (lines 7, 10, 12, and 14). Arguments 7 (with value 
4) and 8 (a pointer to the location of x4) are stored on the stack at offsets 0 and 8 
relative to the stack pointer. 
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(a) C code for calling function 


long call_proc() 
1 
long xi = 1; int x2 = 2; 
short x3 = 3; char x4 = 4; 
proc(xi, &x1, x2, &x2, x3, &x3, x4, Ex4); 
return (x1+x2)*(x3-x4) ; 


} 
(b) Generated assembly code 


long call_proc() 

call_proc: 
Set up arguments to proc 
subq $32, %rsp Allocate 32-byte stack frame 
movq $1, 24 CÀrsp) Store 1 in &x1 
movi $2, 20(%rsp) Store 2 in &x2 
movw $3, 18Cirsp) Store 3 in &x3 
movb $4, 17(%rsp) Store 4 in &x4 
leaq 17 rsp), Wrex Create &x4 
movq . "rax, 8(%rsp) Store &xi as argument 8 
movl $4, C&rsp) Store 4 as argument 7 
leaq ig(%rsp), %r9 Pass &x3 as argument 6 
movl $3, Ar8d Pass 3 as argument 5 
leaq 20(%rsp), Arcx Pass &x2 as argument 4 
movl $2, %edx Pass 2 as argument 3 
leaq 24(%rsp), 4rsi Pass £xi as argufent 2 
movl $i, %edi Pass 1 as argument 1 
Call proc 
call proc 
Retrieve changes.to memory 
movsiq 20(4rsp), %rdx Get x2 and convert to. long 
addq 24(%rsp), Ardx Compute x1+x2 
movswl 18 (érsp), %eax Get x3 and convert to int 


2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 


- 
o 


nN o= å- 
OQ oA NI 


móvsbl 17(%rsp)-, %ecx Get x4 and convert to int 


N 
= 


subl ecx, eax Compute x3-x4 


N 
tN 


citq Convert to long 

imulq ‘%rdx, %rax Compute (xi*x2) * (x3-x4) 
addq $32, Wrsp Deallocate stack frame 
25 ret Return 


N N 
> w 


Figure 3.32 „Example of code to call function proc, defined in Figure 3.29. This code 
creates a stack frame. 
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Figure 3.33 

Stack frame for function 
call_proc. The stack 
frame contains local 
variables, as well as two of 
the arguments to pass to 
function proc, 


Return address 
rium 
Mio &x4 
AT 


Stack pointer 
ATSP 


* 





Argument 7 


When procedure proc is called, the program will begin executing the code 
shown in Figure 3.29(b). As shown in Figure 3.30, arguments 7 and 8 are now 
at offsets 8 and 16 relative to the stack pointer, because the return address was 
pushed onto the stack. 

When the program returns to ca11. proc, the code retrieves the values of the 
four local variables (lines 17-20) and performs the final computations. It finishes 
by incrementing the stack pointer by 32 to deallocate the stack frame. 


3.7.5 Local Storage in Registers 


The set of program registers acts as a single resource shared by all of the proce- 
dures. Although only one procedure can be active at a given time, we must make 
sure that when one procedure (the caller) calls another (the callee), the callee does 
not overwrite some register value that the caller planned to use later. For this rea- 
son, x86-64 adopts a uniform set of conventions for register usage that must be 
respected by all procedures, including those in program libraries. 

By convention, registers 4rbx, 4rbp, and %r12-%r15 are classified as callee- 
saved registers. When procedure P calls procedure Q, Q must preserve the values 
of these registers, ensuring that they have the same values when Q returns to P as 
they did when Q was called. Procedure Q can preserve a register value by either not 
changing it at all or by pushing the original value on the stack, altering it, and then 
popping the old value from the stack before returning. The pushing of register 
values has the effect of creating the portion of the stack frame labeled *Saved 
registers" in Figure 3.25. With this convention, the code for P can safely store a 
value in a callee-saved register (after saving the previous value on the stack, of 
course), call Q, and then use the value in the register without risk of it having been 
corrupted. 

All other registers, except for the stack pointer %rsp, are classified as caller- 
saved registers. This means that they can be modified by any function. The name 
"caller saved" can be understood in the context of a procedure P having some local 
data in such a register and calling procedure Q. Since Q is free to alter this register, 
itis incumbent upon P (the caller) to first save the data before it makes the call. 

Asanexample, consider the function P shown in Figure 3.34(a). It calls Q twice. 
During the first call, it must retain the value of x for use later. Similarly, during 
the second call, it must retain the value computed for Q(y). In Figure 3.34(b), 
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(a) Calling function 


long P(long x, long y) 
1 
long u = QCy); 
long v = Q(x); 
return + vj 


} 
(b) Generated assembly code for the calling function 


long P(long x, long y? 
x in 4rdi, y in %rsi 
pushq %rbp Save %rbp 
pushq 4%rbx Save %rbx 
subq $8, “rsp Align stack frame 
movq Ardi, Arbp Save x 
movqd' %rsis -Ufdi Move yt first argument'™ 
call Q’ Cali Q(y) 
movq %rax, 4rbx Save result 
movq Xrbp, Ardi Move x to fitst, argument :Q 
call Q Call Q(x) 
addq %rbx, Arax Add saved Q(y) to Q(x) 
addq $8, Arsp Deallocate last part of stack 
popq Xrbx C Restore %rbx 
popa ^rbb' Restore %rbp 
ret 


Figure 3.34 Code demonstrating use of callee-saved registers. Value x must be 
preserved during the first call, and value Q(y) must be preserved during the second. 
A “i . 

we can see that-the code generated iby, Gcc uses two callee-saved registers: 4rbp 
to hold x, and %rbx to hold the computed value of Q (y). At the beginning of the 
function, it saves the values of these two registers on the stack (lines 2-3). It copies 
argument x to %rbp before the first call-to Q (line 5). It copies the:result of this call 
to %rbx before the second call to Q (line 8). At the end of the function (lines 13- 
14), it restores the values of the two callee-saved registers by popping them off.the 
stack. Note how they are popped in the reverse order from how they were pushed, 
to account for the last-in,-first-out discipline of a stack. — 1^ ^ 


^ 


Considér a function P, whicli ge al'valües, named a0-a8. It then calls 
function Q using thesé generated values as arguments. Gcc produces the following 
code for the first part of P: 


3 
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long P(long x) 


x in Ardi 

1 P: 

2 pushq %r15 

3 pushq %r1i4 

4 pushg %r13 

5 pushq %r12 

6 pushg %rbp 

7 pushq 4rbx 

8 subq $24, trsp 

9 movq Xrdi, %rbx 

10 leaq Arai), “rid 
11 leaq 2Chrdi), %r14 
12 leaq 3¢C%rdi), %r13 
13 leaq 4(irdi), %r12 
14 leaq 5(%rdi), drbp 
15 leag 9 6(4rdi), %rax 
16 movq ‘vax, (Arsp) 
17 leaq T(Ardi), Árdx 
18 movq %rdx, 8(%rsp) 
19 movl $0, %eax 
20 call Q 


A. Identify which local values get stored in callee-saved registers. 
B. Identify which local values get stored on the stack. 


C. Explain why the program could not store all of the local values in cailee- 
saved registers. 


3.7.6 Recursive Procedures 


The conventions we have described for using the registers and the stack allow 
186-64 procedures to call themselves recursively. Each procedure call has its own 
private space on the stack;'ánd'so the local variables of the multiple outstanding 
talls do not interfere with one another. Furthermore, the stack discipline naturally 
provides the proper policy for allocating local storage when the procedure is called 
and deallocating it before returning. 

Figure 3.35 shows both the C code and the 'generated assembly code for a 
recursive factarial function. We can see that the assembly code uses register 4rbx 
to hold the parameter n, after first saving the existing value on the stack (líne 2) 
and later restoring the value before returning (line 11). Due to the stack discipline, 
and the register-saving conventions, we can be assured that when thé tecursive call 
to rfact (n-1) returns (line 9) that (1) the resuit of the call will be held in register 
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(a) C code 


long rfact(long n) 
{ 
long result; 
if (n <= 1) 
result = 1; 
else 
result = n * rfact(n-1); 
return result; 


} 
(b) Generated assembly code 


long rfact (long n) 


Arbx Save Xrbx 
%rdi, 4rbx Store n in callee-saved register’ 
$1, Zeax Set return value = 1 
$1, %rdi Compare n:i 
.L35 If <=, goto done 
-1(%rdi), Ardi Compute n~1 
rfact Call rfact(n-1) 
$rbx, Arax Multiply result by n 

done: 
%xrbx Restore %rbx 

Return 


Figure 3.35 Code for recursive factorial program. The standard procedure handling 
mechanisms suffice for implementing recursive functions. 


%rax, and (2) the value of argument n will held in register 4rbx. Multiplying these 
two values then computes the desired result. 

We can see from this example that calling a function recursively proceeds just 
like any other function,call. Our stack discipline provides a mechanism .where 
each invocation of a function!has 4ts:own private. storage for state-information 
(saved values-of,the, return location and callee-saved registers). If need be, it 
cart also provide storage forlọcal variables. The stack discipline of allocation and 
deallocation naturally matches the call-return ordering of functions: This method 
of implementing function calls and returns even works-for more complex patterns, 
including mutual recursion (e.g., when procedure P calls Q, which in turn calls P). 


£ 


factice, Problems3,30,.(s6 EINE 


For a C function having the general structure — : 
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long rfun(unsigned long x) { i 
MC ) 
return + 04; 
unsigned long nx = 2»; 
long rv = rfun(nx); 
return 





? 


} 





| Gcc generates the following assembly code: 


long rfun(unsigned long x) 





I x in žrdi D 
[Og rfun: 
| 2 pushq — 4rbx 

3 movq "rdi, Arbx 
[04 movi $0, Leax 
5 testq žrdi, %rdi 
6 je .L2 
| 7 shrq $2, Yrdi 
| 8 call rfun 
Pi addq Wrbx, Arax 
110 .L2: 
in Popq 4rbx 
[12 ret 


| A. What value does rfun store in the callee-saved register &rbx? 


' B. Fill inthe missing expressions in the C code shown above: 
eee eet 
l 


3.8 Array Allocation and Access 


P r 

Arrays in C are one means of aggregating scalar data into larger data types. C 
uses a particularly simple implementation of arrays, and hence the translation 
into machine code is fairly straightforward. One unusual feature of C is that we 
ban generate pointers to elements within arrays and perform arithmetic with these 
pointers. These are translated into ‘address computations in machine code. 

! Optimizing compilers are particularly good at simplifying the address compu- 
‘ations used by array indexing, This can make the correspondence between the C 
‘ode and its translation into machine code somewhat difficult to decipher. 

| 


}.8.1 Basic Principles ” 


ior data type T and integer constant N, consider a declaration of the form 


! AEN; 





mE ————— | 
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Let us denote the starting location as x,. The declaration has two effects. First, 
it allocates a contiguous region of L - N bytes in memory, where L is the size (in 
bytes) of data type T. Second, it introduces an identifier A that can be used as 
a pointer to the beginning of the array. The value of this pointer will be.x,. The 
array elements can be accessed using an integer index ranging between 0 and N —1. 
Array element i will be stored at address x, + L-i. 

As examples, consider the following declarations: 


char A[12]; 
char ¥*B[8]; 
int C61; 
double *D[5]; 


These declarations will generate arrays with the following parameters: 


Array Element size Total size Start address Elementi 
ee a 


A 1 12 Xa xa +i 
B 8 64 xB Xp + 8i 
C 4 24 Xc Xg t 4i 
D 8 40 Xp xp 8i 


Array A consists of 12 single-byte (char) elements. Array C consists of 6 integers, 
each requiring 4 bytes. B and D are both arrays of pointers, and hence the array 
elements are 8 bytes each. i 

The memory referencing instructions of x86-64 are designed to simplify array 
access. For example, suppose E'is an array of values of type int and we wish‘to 
evaluate E[i], where the address of E is stored in register Xrdx and'i is stored in 


register rex. Then the instruction 














movl (%rdx,%rcox,4) ,,eax 


will perform the address computation xg + 4i, read that memory location, and 
copy the result to register %eax. The allowed scaling factors df 1, 2, 4, and 8 cover 


the sizes of the common primitive data types. 
t 





short S[T71; 2 
short *T(3]; 
short  **U[6]; 
int V[8]; 
double +WD]; 


Fill in the following table describing the element size, the total size, and the | 
address of element i for each of these arrays. 








Array Elementsize Totalsize Startaddress Elementi 








- EAE Xs a 
PENSEE ab oo Xr 


(ok, PEE Xy 


T 
U PET NES EEEE Xy 
V 
W 


Ss ae EE LLL Xy 


3.8.2 Pointer Arithmetic 


C allows arithmetic on pointers, where the computed value is scaled according to 
the size of the data type referenced by the pointer. That is, if p is a pointer to data 
of type T, and the value of p is x,, then the expression p*i has value x, + L-i, 
where L is the size of data type T. 

The unary operators '&' and ‘+’ allow the generation and dereferencing of 
pointers. That is, for an expression Expr denoting some object, &Expr is a pointer 
giving the address of the object. For an expression AExpr denoting an address, 
*AExpr givés the value at that address. The expressions Expr and *&Expr are 
therefore equivalent. The array subscripting operation can be applied to both 
arrays and pointers. The array reference A [i] is identical to the expression * (Ati). 
It computes the address of the ith array element and then accesses this memory 
location. 

Expanding on our earlier example, suppose the starting address of integer 
array E and integer index | are stored in registers Ardx and %rcx, respectively. 
The following are some expressions involving E. We also show an assembly-code 
implementation of each expression, with the result being stored in either register 
Zeax (for data) or register 4rax (for pointers). 


Expression Type Value Assembly code 

E int * Xg movi Ardx,Árax 

EÍO1 int Mixe] movl (rdx) ,Zeax 

E[i] int Mixg + 4i] movl (irdx,%rcex,4) , eax 
&E[2] int*  xg-8 leaq 8(%rdx) , %rax 

Eti-1 int * xet+ 4i — 4 leaq -4(%rdx,4rcx,4) ,Arax 
*(E+i-3) int Mixe+4i—12) movi -12(érdx,4Zrcx,4) ,%eax 
&E[i]-E long i movq Arcx,A4rax 


In these examples, we see.that operations that return array values have type 
int, and hence involve 4-byte operations (e.g, mov1) and registers (e.g. Zeax). 
Those that return pointers have type int *, and hence involve 8-byte operations 
(e.g, leaq) and registers (e.g., %rax). The final example shows that one can 
compute the difference of two pointers within the same data structure, with the 
result being data having type long and value equal to the difference of the two 
addresses divided by the size of the data type. 
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Suppose xg, the address of short integer array S, and long integer index i are stored 
in registers %rdx and %rcx, respectively. For each of the following expressions, give 
its type, a formula for its value, and an assembly-code implementation. The result 
should be stored in register %rax if it is a pointer and register element %ax if it has 
data type short. 


Expression Type Value Assembly code 
Tp O E —————— 


S+1 Be Sea sole PE 
s[3] LaLa i 
&S [i] €— Nep MEM ENS 
S[4xi*1] EINEN US pe — 
S+i-5 ae paLa aai 


DS ce 


« 


3.8.3 Nested Arrays : 


The general principles of array allocation and referencing hold even when we 
create arrays of arrays. For example, the declaration 


int A[51(3]; 
is equivalent to the declaration 


typedef int row3 t[3]; 
row3_t A[5]; 


Data type row3_t is defined to be an array of three integers. Array A contains five 
such elements, each requiring 12 bytes to store the three integers. The-total array 
size is then 4 - 5-3 = 60 bytes. 

Array A can also be viewed as a two-dimensional àrray with five rows and 
three columns, referenced as A [0] [0] through A[4] [2]. The array élements are 
ordered in memory in row-major order, meaning all'elements of row 0, which 
can be written A[0], followed by all elements of row 1 (A[11), and so on. This is 
illustrated in Figure 3.36. 

This ordering is a cónsequence of our nested declaration. Viewing A as an 
array of five elements, each of which is an array of three int's, we first have A[0], 
followed by A L1], arid so on. 

To access elements of multidimensional arrays, the compiler generates code to 
compute the offset of the desired element and then uses one of the Mov instructions 
with the start of the array asthe’ base address' and the (possibly scaled) offset as 1 
an index. In general, for an array declared as - 4 


T D[RI (CI; 
array element D [1] [j] is at memory address 


&D[il[j] = xp + LC -i+ j) (3.1) ] 
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Figure 3.36 Row Element Address 


Elements of array in ACO] 
row-major order. xX, +4 
A[03[21 | x, 48 
| Ata] £0) | X4 12 
x,* 16 
X, + 20 
X, + 28 
X, + 32 
X, + 36 
X, 40 
X, +44 
X, + 48 


where L is the size of data type T in bytes. Asan example, consider the 5 x 3integer 
array A defined earlier, Suppose x,, i, and J are in registers "rdi, 4rsi, and %rdx, 
respectively. Then array element A[i] [ jl can be copied to register Yeax by the 
following code: 






Xs 













A[1] 


A[2] 


Af] 








A[4] 






X, +52 






X, + 56, 





A in Zrdi, i in %rsi, and j in Yrdx 


1 leaq (%rsi,4rsi,2) , “vax Compute 3i 
2 leaq (žrdi, %žrax,4), %rax Compute x, + 12i 
3 movl Chrax,%rdx,4), %eax Read' from M[x, + 12i + 4] 


As can be seen, this code computes the element’s address as x, dT 12i +4j =x t 
4(3i + j) using the scaling and addition capabilities of x86-64 address arithmetic. 


Consider the following source code, where M and N are constants declared with 
#define: 









long P[M] [N]; 
long Q[N] [M]; 


long sum element (long i, long j) { 
return P[1][j]'« Qj] [1]; 


In compiling this program, Gcc generates the following assembly code: 
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long sum element(long j, long j) 
i in Urdi, j in Xrsi 
sum element: 
leaq OC, Ardi,8), Ardx 
subq Wrdi, %rdx 
addq rsi, Ardx 
leaq (%Mrsi,4rsi,4), %rax 
addg = frax, %rdi 
movq QC, 4rdi,8), %rax 
addq P(,%rdx,8), Zrax 
ret 


wo C5 0 0 4n RW DSS 


Use your reverse engineering skills to determine the values of M and N based 
on this assembly code. 


3.8.4 Fixed-Size Arrays 


The C compiler is able to make many optimizations for code operating on multi- 
dimensional arrays of fixed size. Here we demonstrate some of the optimizations 
made by ccc when the optimization level is set with the flag -01. Suppose we 
declare data type fix matrix to be 16 x 16 arrays of integers as follows: 


#define N 16 
typedef int fix matrix[N](N]; 


(This example illustrates a good coding practice. Whenever a program uses some 
constant as an array dimension or buffer size, it is best to associate a name with 
it via a stdefine declaration, and then use this name consistently, rather than 
the numeric value. That way, if an occasion ever arises to change the value, it 
can be done by simply modifying the #define declaration.) The code in Figure 
3.37(a) computes element i, k of the product of arrays A and B—that is, the. 
inner product of.row i from A:and column k from B. This product is given by 
the formula ? 9. 4j; : bj 4. GCC generates code that we then recoded into 
C, shown as function fix prod, ele, opt jn Figure 3.37(b). This code contains 
a number of clever optimizations. It removes the integer index j and converts all 
array references to pointer dereferences. This involves (1) generating a pointer, 
which we have named Aptr, that points to successive elements in row i of A, 
(2) generating a pointer, which we have named Bptr, that points to successive 
elements in column k of B, and (3) generating a pointer, which we have named 
Bend, that equals the value Bptr will have when it is time to terminate the loop. 
The initial value for Aptx is the address of the first element of row i of A, given 
by the C expression £A [i] [0]. The initial value for Bptr is the address of the first 
element of column k of B, given by the C expression &B [0] [k]. The value for Bend 
is the index of what would be the (n + 1)st element in column j of B, given by the 


C expression &B [N] [k]. 
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(a) Original C code 

/* Compute i,k of fixed matrix product */ 

int fix prod ele (fix matrix A, fix matrix B, long i, long k) { 
long j; 
int result = 0; 


for (j = 0; j < N; j++) 
result += A[i][j] * B[jlfkl; 


return result; 


} 
(b) Optimized C code 


1 /* Compute i,k of fixed matrix product */ 

2 int fix pfod elé opt(fit matrix A, fix matrix B, long i, long k) { 

3 int *Aptr = &A[i] [0]; /* Points to elements in row i of A */ 
4 int *Bptr = &B[0] [k]; /* Points to elements in column k of B */ 
5 int *Bend = &B[N] [k]; , f* Marks stopping point for Bptr x/ 
6 
7 
8 
9 


‘int result = 0; 

do { /* No need for initial test */ 
result += *Aptr * *Bptr; /* Add next product to sum ¥*/ 
Aptr ++; /* Move Aptr to next column */ 
Bptr += N; /* Move Bptr to next row */ 

} while (Bptr != Bend); /* Test for stopping point */ 

return result; 


} 


Figure 3.37 Original and optimized code to compute element i, k of matrix product 
for fixed-length arrays. The compiler performs these optimizations automatically. 


The following is the actual assembly code generated by ccc for function fix. 
prod_ele. We see that four registers are used as follows: %eax holds result, %rdi 
holds Aptr, %rcx holds Bptr, and %rsi holds Bend. 


int fix prod ele opt(fix matrix A, fix matrix B, long i, long k) 

A in %rdi, B in Zrsi, i in Ardx, k in XZrcx 

fix prod, ele: 
salq $6, %rdx Compute 64 + i 
addq Wrdx, %rdi Compute Aptr = x,-+64i = &A[il[0] 
leaq Corsi, %rcx,4), %rcx Compute Bptr = xg +4k = &Bf0] [k] 
leaq 1024 (4rcx), %rsi Compute Bend = xp + 4k +1024 = BIN] [k] 
movl $0, eax Set result = 0 

-L7: loop: 
movi (rdi), %edx Read *Aptr 
imull (rex), %edx Multiply by *Bptr 
addl %edx, %eax Add to result 


O O00 ^4 AUGA w Nm 


261 





262 Chapter 3 Machine-Level Representation of Programs 


11 addq $4, žrdi : Increment Aptr ++ 
12 addq $64, %rcx Increpent Bptr += N 
13 cmpq rsi, %rcx Compare Bptr;Bend 


14 jne L7 If !=, goto loop 
15 rep; ret Return 


Use Equation 3.1 to explain how the computations of the initial values for Aptr, 
Bptr, and Bend in the C code of Figure 3.37(b) (lines 3-5) correctly describe their 
computations in the assembly code generated for fix prod. ele (lines 3-5). 


& 


DS R^ b ice. : 
The followi 
val: " 
/* Set all diagonal eléments to val */ 
void fix set. diag(fix matrix A, int val) i 

long i; 

for (i = 0; i < N; i) 

A[il [i] = val; 


When compiled with optimization level -01, GCC generates the following 
assembly code: 


1 fix set diag: 
void fix set diag(fix matrix A, int"val) 
A in %rdi, val in‘ fred 
movl $0, %eax 
.L13: 
movl Jesi, (žrdi, %rax) 
addq $68, %rax 
cmpq $1088, %rax 
jne .L13 
rep; ret 


Create a C code program fix set. diag opt that uses optimizations similar 
to those in the assembly code, in the same style as the code in Figure 3.37(b). Use 
expressions involving the parameter N rather thari integer constants, so that your 
code will work correctly if N is redefined. i 


3.8.5 Variable-Size Arrays 


Historically, C only supported multidimensional arrays where the sizes (with the 1 
possible exception of the first dimension) could be determined at compile time. | 
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Programmers requiring variable-size arrays had to allocate storage for these arrays 

using functions such as malloc or calloc, and they had to explicitly encode the 

mapping of multidimensional arrays into single-dimension ones via row-major in- 

dexing, as expressed in Equation 3.1. ISO C99 introduced the capability of having 

array dimension expressions that are computed as the array is being allocated. 
In the C version of variable-size arrays, we can declare an array 


int A[expr1] [expr2] 


either as a local variable or as an argument to a function, and then the dimensions 
of the array are determined by evaluating the expressions expr and expr2 at the 
time the declaration is encountered. So, for example, we can write a function to 
access element i, j of an n x n array as follows: 


int var ele(long n, int A[n][n], long i, long j) f 
return A[i][jl; 
} 


The parameter n must precede the parameter A [n] [n], so that the function can 
compute the array dimensions as the parameter is encountered. 
Gcc generates code for this referencing function as 


int var ele(long n, int A[n][n], long i, long j) 
n in 4rdi, A in Zrsi, i in Zrdx, j in %rcx 
var ele: 
imulq ‘%rdx, %rdi Compute n - i 
leaq Cérsi,&%rdi,4), %rax Compute xa +4(n- i 
movl (rax, 4rcx,4), %eax Read from M[x, -4(n- i)+4j] 
ret 


As the annotations show, this code computes the address of element i, j aS x, + 
4(n i) c 4j 2 x, +4(n-i -- j). The address computation is similar to that of the 
fixed-size array (Section 3.8.3), except that (1) the register usage changes due to 
added parameter n, and (2) a multiply instruction is used (line 2) to compute n - i, 
rather than an leag instruction to compute 3j. We see therefore that referencing 
variable-size arrays requires only a slight generalization over fixed-size ones. The 
dynamic version must use a multiplication instruction to scale i by n, rather than 
a series of shifts and adds. In some processors, this multiplication can incur a 
significant performance penalty, but it is unavoidable in this case. 

When variable-size arrays are referenced within a loop, the compiler can often 
optimize the index computations by exploiting the regularity of the access patterns. 
For example, Figure 3.38(a) shows C code to compute element i, k of the product 
of two n x n arrays A and B. Gcc generates assembly code, which we have recast 
into C (Figure 3.38(b)). This code follows a different style from the optimized 
code for the fixed-size array (Figure 3.37), but that is more an artifact of the choices 
made by the compiler, rather than a fundamental requirement for the two different 
functions. The code of Figure 3.38(b) retains loop variable j, both to detect when 
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(a) Original C code 


1  /* Compute i,k of variable matrix product */ 

2 int var. prod, ele (long n, int A[g][n], int BÜn][n], long i, long x) X 

E long j; pin! if t n £5 

4 int result = 0; ' as á 
5 

6 for (j = 0; j € n; j++) 

7 result += ALi) (j] * B[j] [k]; n 
8 1 i 
9 return result; 

10 3 1 

(b) Optimized C code 


/* Compute i,k of variable matrix product */ 
int var prod. ele opt (long n, int A[n] [n], int B[n] [n], long i, long k) t 
int *Arow = Ali]; 
int *Bptr = &B[0] [k]; 
int result = 0; 
long j; 
for (j20; j < n; jt) 1 
result += Arow[j) * *Bptr; 
Bptr += n; 
} 
return result; 


} 


Figure 3.38 Original and optimized code to compute element i, k of matrix product for Variable-size 
E : : a 
arrays. The compiler performs these optimizations automatically. b 


the loop hás tertninated and to Índéx into an array consisting of the elements of 
row i of A. E l 
The following is the assembly éode for the loop of var_prod_ele: 
3 i 


a 


Régisters:'n in %rdi, Arow in rsi, Bptr-in %rcx 
4n in %r9, result in Xdax, j'in Xédx 


1 .L24: D loop: a 
2 movl (4xrsi,%rdx,4), 4r8d Read Arow[j] 

3 imuli (%rex), 48d "^ ' t Multiply by *Bptr 

4 'add1 %r8d, %eax Add to result 

5 addq $1, %rdx é j++ 3e tox 

6 addq “r9, Arcx bo Bptr += n t 

7 cmpq fyrdï, rdx Compare j:n 

8 jne " .L24 'If I=, goto loop ) 


" 2] 
We see that the program makes use of both a scaled value 4n (register kr9) for 
incrementiüg Bptr as well as the value of n (register Xxdi)-to:check the loop 
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bounds. The need for two values does not show up in the C code, due to the scaling 
of pointer arithmetic. 

We have seen that, with optimizations enabled, ccc is able to recognize pat- 
terns that arise when a program steps through the elements of a multidimensional 
array. It can then generate code that avoids the multiplication that would result 
from a direct application of Equation 3.1. Whether it generates the pointer-based 
code of Figure 3.37(b) or the array-based code of Figure 3.38(b), these optimiza- 
tions will significantly improve program performance. 


3.9 Heterogeneous Data Structures 


C provides two mechanisms for creating data types by combining objects of dif- 
ferent types: structures, declared using the keyword struct, aggregate multiple 
objects into a single unit; unions, declared using the keyword union, allow an 
object to be referenced using several different'types. 


3.9.1 Structures 


The C struct declaration creates a data type that groups objects of possibly 
different types into a single object. The different components of a structure are 
referenced by names. The implementation of structures is similar to that of arrays 
in that all of the components of a structure are stored in a contiguous, region of 
memory and a pointer to a structure is the address of its first byte. The compiler 
maintains information about each structure type indicating the byte offset of 
each field. It generates references to structure elements using these offsets as 
displacements in memory referencing instructions. 
As an example, consider the following structure declaration: 


struct rec { 
int i; 
int j; 
int a[2]; 
int *p; 


H 


This structure contains four fields: two 4-byte values of type int, a two-element 
array of type int, and an 8-byte integer pointer, giving a total of 24 bytes: 


Offset 0 4 16 24 


8 


Observe that array a is embedded within the structure. The numbers along 
the top of the diagram give the byte offsets of the fields from the beginning of the 
structure. 

To access the fields of a structure, the compiler generates code that adds the 
appropriate offset to the address of the structure. For example, suppose variable r 
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New to C? ‘Representing‘an object,as a struct 


LI 


LJ 


a 


The struct data type constructor is the closéstthing C provides to the objects of C£ and Java. Itallows 
the programmer to kéep information about Some entity in-a'ingle data structuré arid to referencé that " 


R n E m 


information with names. 
For example, a a graphics prógram might répresent a rectangle as 4 structure: 
58 
struct rect { 


long 11x; /* X coordinate of lower-left corner’ */ 

long lly; /* Y coordinate of lower-left corner */ 
unsignéd long width; /* Width (in pixels). */ 
unsigned long height; /* Height (in pixels) */ 
unsigned color; * Yx Colitig of color’ a */ 


}; i ^" "Ho. 


struct rect r; 

r.llx = r.lly = 

r.color = OxFFOOFE; 

r.width = o : 

r.height = 20; + g ? at 


where the expression 7 r.11x selects field Aix of Structure r. 


Alternatively, we can both déclare ‘the Variable arid initialize its fields with a single s statemént:, 


struct rect r = (10, 0, OxFFOQOFF; .10; 20,3; s 


H 
: Xv s zd at: ; og ; 
We can,declare a variable r of type struct rect and set its field values as follows: 


# 


It is common to pass ‘pointers: to structures from one place tö aridther rather than copying ‘them. 
For example, the following function computes the area'of a réctangle, ‘where'a Cpoititer tó the rectangle > 


struct is passed to the function: 


long area(struct rect *rp) { 
return (*rp).width * (*rp).height; 
} 


The expression (*rp) . width dereferences the pointer and’ selects, the width-field of the resulting 
structure. Parentheses are required, because the compiler would interpret the expression *rp. width as 
*(rp. width), which is not valid.:This combination of dereferencing and field selection is so common 


that C provides an,alternative notation using -5. That is, rp-^width is equivalent to the, expression , 


(*rp) .width. For example, we can write a function that rotates'a rectangle counterclockwise by 90 


degrees as 


void rotate left($trnct rect *rp), { 
/* Exchange width and height */ a 
long t = rp->height; : 
rp->height = rp->width; : » 
rp->width = t; 2 * s. t " s 
/* Shift to new lower-left corner */ 
rp->1lx ~=,.03 * à 


1 


$ 
wd 


| 


" 














Section 3.9 Heterogeneous Data Structures 267 
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; New to C? Representing an object'as a struct (continued) 


à ES 


i The objects of C++ and Java are more elaborate than structures in C, in that they also associate 


* a set of methods with an object that can be invoked to perform compütation.-In.C, we would simply 
j writ€ these as ordinary functions, such as the functions area and rotate left shown previously. 


9 Rr veros MO PORC W Pe LIP" Em wow w 3 (9 
— 


of type struct rec * is in register %rdi. Then the following code copies element 
r->i to element r->j: 


Registers: r in Ztdi 
movl (žrdi), Yeax Get r->i 
movl eax, 4(%rdi) Store in r->j 


Since the offset of field i is 0, the address of this field is simply the value of r. Tò 
store into field j, the code adds offset 4 to the address of x. 

To generate a pointer to an object within a structure, we can simply add the 
field's offset to the structure address. For example, we can generate the pointer 
&(r-»a[1]) by adding offset 8 + 4 - 1 = 12. For pointer r in register %rdi and long 
integer variable i in register %rsi, we can generate the pointer value £(r-»a [1]) 
with the single instruction 


Registers: r in žrdi, i Xrsi 
leaq 8(Xrdi,4rsi,4), %rax Set %rax to ġrali] 


As a final example, the following code implements the statement 


r-»p = &r->alr->i + r->j]; 
starting with r in register 4rdi: 


Registers: r in rdi 

movl 4(%rdi), %eax Get r->j 

addl (žrdi), Y%eax Add r->i 

cltq Extend to 8 bytes 

leaq 8(%rdi,žrax,4), %rax Compute &r->alr->i + r->j] 
movq “rax, 16(%rdi) Store in r->p 


As these examples show, the selection of the different fields of a structure is 
handled completely at compile time. The machine code contains no information 
about the field declarations or the names of the fields. 
































eclaration: 


Consider the following structure d 
struct prob { 
int *p; 
struct { 
int x; 
int y; 
s 
struct prob *next; 


3; 


This declaration illustrates that one structure can be embedded within another, 
just as arrays can be embedded within structures, and arrays can be embedded 


within'arrays. P 

The following procedure (with some expressions omitted) operates on this 
structure: E 
t 








void sp_jnit(struct prob *sp) { n i j 
sp-3s.x =" E. 
a Sp-*p iy ; i i, à + H 
sp->next = ; 7 
v * r & 


A. What are the offsets (in bytes) of the following fields? 


p: 
s.x: 
S.y: 
next: 









B. How many total bytes does the structure require? 
C. The compiler generates the following assembly code for sp. init: 


void sp_init(struct prob *sp) 


sp in žrdi 
1 sp_init: 4 
2 movi 12(%rdi), eax 2 t 
3 movl Yeax, 8(%rdi) 5 
4 leaq 8(%rdi), %rax 
5 movq Xrax, (4rdi) 
6 movq Ardi, 16(%rdi) 
7 ret ' F 


‘  'Op'the basis of this inférmation, fill in the missing expréssions in'the code 
“ et 7 


for sp_init. ; 
NN 
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The following cole ram ti decia tion ora a structure of type ELE me the 
prototype for a function fun: 


struct ELE ( 
long v; 
struct ELE *p; 
3; 


long fun(struct ELE *ptr); 


When the code for fun is compiled, ccc generates the following assembly 
code: 


long fun(struct ELE *ptr) 


ptr in Xrdi 
1 fun: 
2 movi $0, Xeax 
3 jmp .L2« 
4 .D3: 
5 addq (rdi), %rax 
6 movq  8(4rdi), žrdi 
7 .L2: " 
8 testq žrdi, žrdi 
9 jae .L3 : 
10 rep; ret 


A. Use your reverse engineering skills to write C code for fun. 


B. Describe the data structure that this structure implements and the operation 
performed by fun. 





3.9.2 Unions 


Unions provide a way to circumvent the type system of C, allowing a single object 
to be referenced according to multiple types. The syntax of a union declaration is 
identical to that for structures, but its semantics are very different. Rather than 
having the different fields reference different blocks of memory, they all reference 
the same block. 

Consider the following declarations: 


struct S3 { 
char c; 
int i[2]; 
double v; 


Hh 
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union U3 { 
char c; 
int i[2]; 
double v; 
F; 


When compiled on an x86-64 Linux machine, the offsets of the fields, as well as 
the total size of data types S3 and U3, are as shown in the following table: 


v Size 


(We will see shortly why i has offset 4 in S3 rather than 1, and why v has offset 16, 
rather than 9 or 12.) For pointer p of type union US *, references p->c, p->i [0], 
and p->v would all reference the beginning of the data structure. Observe also 
that the overall size of a union equals the maximum size of any of its fields. 

Unions can be useful in several contexts. However, they can also lead to nasty 
bugs, since they bypass the safety provided by the C type system. One application 
is when we know in advance that the use of two different fields in a data structure 
will be mutually exclusive. Then, declaring these two fields as part of a union rather 
than a structure will reduce the total space allocated. H 

For example, suppose we want to implement a binary tree data: structure 
where each leaf node has two double data values and each internal node has 
pointers to two children but no data. If we declare this as 


struct node. s { 
struct node, s *left; 
struct.node s *right; 
double data[2]; 

}; 


then every node requires 32 bytes, with half the bytes wasted for each type of node. 
On the other hand, if we declare a node as P* ME 


union node u { 
struct ( 
unioh node_u *left; 
union Zode u *right; 
) internal; 
double data[2]; 
F; 


then every node will require just 16 bytes. If n is a pointer to a node of type 
union node_u *, we would reference the data of a leaf node as n->data[0] 
and n-»data [1], and the children of an internal node as n->internal.left and 
n->internal.right. j 





structure containing a tag field and the union: 


typedef enum { N_LEAF, N_INTERNAL } nodetype t; 


struct node t { 
nodetype t type; 
union { 
struct { 
Struct node_t *left; 
struct node_t *right; 
} internal; 
double data[2]; 
} info; 


3; 


This structure requires a total of 24 bytes: 4 for type, and either 8 each for 
info.internal.left and info. internal.right or 16 for info. data. As we will 
discuss shortly, an additional 4 bytes of padding is required between the field for 


unsigned long u = (unsigned long) d; 


unsigned long double2bits (double d) { 
union { 
double d; 
unsigned long u; 
} temp; 
temp.d = d; 
return temp.u; 
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Section 3.11 The numeric value of u will bear no relation to that of d, except for 


the case when-d is 0.0. 

When using unions to:combine data types of different Sizes, byte-ordering, 
issues can become important. For example, süppose we write-a procedure that 
will create an 8-byte double using the bit patterns given by two 4-byte unsigned 


values: 


double uu2double(unsigned word0,, unsigned wordi) 
1 
union { 
double d; 
unsigned u[2]; 
) temp; 






u 


temp.u[0] = word0; 
temp.u[1] = wordl; t 
return temp.d; 





} 





On,a little-endian machinę, such as an x86-64, processor, argument wordO will 
become the low-order 4 bytes of d, while wordt will become the high-order 4, 
bytes. On a big-endian machine, the role of,the.two arguments will be reversed, 

f Ww mor t tt 
actice Brobfem 3:43 < soUution pages: SERE ee kd 
Suppose you are giyen the job of checking thata C compiler generates the proper 
code for structure and union access. You write the following structure declaration: 






















s. ja 







typedef union { 









struct Í 
long u; 
short v; 
char wj 
Di i F 
Fii; 1 
-a Li * 
struct { 
int a[2]; z 
char *p; 
) t2; 
} u type; 





You write a series of functions of the form 





void get(u type *up, fype *dest) 1 
xdest = expr; 





} 


with different access expressions expr and with destination data type-typeiset 
according to type associated ‘with expr. You then examine the code generated 
when compiling the functions to see if they match your expectations. 
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Suppose in these functions that up and dest are loaded into registers žrdi and 
Arsi, respectively. Fill in the following table with data type type and sequences of 
one to three instructions to compute the expression and store the result at dest. 


f 
expr type Code 





up->t1.u long movq (rdi), %rax 
movg %rax, (%rsi) 

up-^ti.v 

&up-^ti.w 


up->t2.a m Lm. z 
up->t2.a[up->t1.u] 


*up->t2.p 


3.9.3 Data Alignment 


Many computer systems place restrictions on the allowable addresses for the 
primitive data types, requiring that the address for some objects must be a multiple 
ofsome value K (typically 2, 4, or 8). Such alignment restrictions simplify the design 
of the hardware forming the interface between the processor and the memory 
system. For example, suppose a processor always fetches 8 bytes from memory 
with an address that must be a multiple of 8. If we can guarantee that any double 
will be aligned to have its address be a multiple of 8, then the value can be read 
or written with a single memory operation. Otherwise, we may need to perform 
two memory accesses, since the object might be split across two 8-byte memory 
blocks. 

The x86-64 hardware will work correctly regardless of the alignment of data. 
However, Intel recommends that data be aligned to improve memory system 
performance. Their alignment rule is based on the principle that any primitive 
object of K bytes must have an address that is a multiple of K. We can see that 
this rule leads to the following alignments: 


K Types 

1 char 

2 short 

4 int, float 

8 long, double, char * 
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Alignment is enforced by making sure that every data:type is organized and 
allocated in such a:way that every object within the type satisfies its alignment 
restrictions. The compiler places directives in.the assembly code indicating the 
desired alignment for global data. For example,the assembly-code declaration of 
the jump table on page 235 contains the following directive on line 2: 










.align 8 





t This ensures that the data following it (in this case the start of the jump table) will 
| start with an address that is a multiple of 8. Since each table entry is 8 bytes long, 
EE. the successive elements will obey the 8-byte alignment restriction. 
| : For code involving structures, the compiler may need to insert gaps in the 
field allocation to ensure that each structure element satisfies its alignment re- 
| i quirement. The structure will then have some required alignment for its starting 
4 address. 
For example, consider the structure declaration 










struct S1 ( 
int i; 
char c; 
int ji 








}; 


Suppose the compiler used the minimal 9-byte allocation, diagrammed as follows: 







Offset 0 4 5 9 


comens [+ [F 3] 
Xx 


Then it would be impossible to satisfy the 4-byte alignment requirement for both 
fields i (offset 0) and j (offset 5). Instead, the compiler inserts a 3-byte gap (shown 


EB here, as shaded in blue) between fields c and j: 









Li 









Offset 0 45 


Contents 











4 As a result, j has offset 8, and the overall structure size is 12 bytes. Furthermore, 

1 the compiler must ensure that any pointer p of type struct S1* satisfies a 4-byte 

alignment. Using our earlier notation, let pointer p have value Xp: Then Xy must 

be a multiple of 4. This guarantees that both p->i (address Xp) and p->j j (addres 
Xp + 8) will satisfy their 4-byte alignment requirements. 

In addition, the compiler may need to add padding to the end of the structure 

so that each element in an array of structures will satisfy its alignment requirement. 


For example, consider the following structure declaration: 
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struct $2 { 
int i; 
int j; 
char C; 


If we pack this structure into 9 bytes, we can still satisfy the alignment requirements 
for fields i and j by making sure that the starting address of the structure satisfies 
a 4-byte alignment requirement. Consider, however, the following declaration: 


struct $2 d[4]; 


With the 9-byte allocation, it is not possible to satisfy the alignment requirement 
for each element of d, because these elements will have addresses x4, x4 + 9, 
X4 + 18, and x4 +27. Instead, the compiler allocates 12 bytes for structure S2, 
with the final 3 bytes being wasted space: 


Offset 0 4 8 9 12 
Contents 





That way, the elements of d will have addresses xg, x4 + 12, x4 + 24, and xa + 36. 
As long as x, is a multiple of 4, all of the alignment restrictions will be satisfied. 





the total size of the structure, and its alignment requirement for x86-64: 


A. struct Pi { int i; char c; int j; char d; }; 


B. struct P2 { int i; char c; char d; long j; }; 
C. struct P3 ( short w[3]; char c[3] }; 

D. struct P4 ( short w[5]; char *c[3] }; 

E. struct P5 { struct P3 a[2}; struct P2t ); 





Answer the following for the structure declaration 





struct ( 
char *a; i 
short b; 
double c; 
char d; 
float e; 


char fr; 
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Aside A case of mandatory alignment, , 


For most x86-64 instructións, keeping‘data aligned improves efficiency, but it does not affect program 

behavior. On the othér hand, sonie"nodels pf Intel and’ AMD processors'will not'work correctly, 
with unaligned data for some of the SSE instructions implementing multimedia operations. These * 
instructions operate on 16-byte blocks of data, and the ipstructions that transfer data between the SSE t 
unit and memory require the memory addresses to be multiples of 16. Any attempt to access memory i 
with an address that does not sátisfy this alignment will lead to an exééption (see Section 1), with the 4 


default behavior forthe program to terminate, i "y i 


Asaresult, any compiler and riii-time system for ar/X86-64 processor must ensure that any memory s 
allocated to hold a'data structure that may be réad from orstoredinto an SSE register must satisfy a 
16-byte alignment. This requirementhas the following tWo consequertces: t 
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* The starting address for,any block genérated,by,a memory allocation function (alloga, malloc, 
calloc, or realloc) must be a multiple of 16. * oa "SN ta haya 

* The stack frame for most functions rhust be aligned on a 16-byte-boundary. (This requirement has 

a number of exceptions.) ur “2% 

ty TD 

More recent versions of x86-64 processors implement the AVX"multiinedid instructióris. In addi- 
tion to providing a superset.of the SSE instructions, processors. supporting AVX, also do not have a $ 


mandatbfy alignment réquirement | = sg, Ati : 
long g; 
int h; 
} rec; 


A. What are the byte offsets of all the fields in the structure? 
B. What is the total size of the structure? 


C. Rearrange the fields of the structure to minimize wasted space, and then 
show the byte offsets and total size for the rearranged structure. 





3.10 Combining Control and Data in 
Machine-Level Programs 


So far, we have looked separately at how machine-level code implements the 
control aspects of a program and how it implements different data structures. In 
this section, we look at ways in which data and control interact with each other. 
We start by taking a deep look into pointers, one of the most important concepts 
in the C programming language, but one for which many programmers only have 
a shallow understanding. We review the use of the symbolic debugger Gps for 
examining the detailed operation of machine-level programs. Next, we see how 
understanding machine-level programs enables us to study buffer overflow, an 
important security vulnerability in many real-world systems, Finally, we examine 
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how machine-level programs implement cases where the amount of stack storage 
required by a function can vary from one execution to another. 


3.10.1 Understanding Pointers 


Pointers are a central feature of the C programming language. They serve as a 
uniform way.to generate references to elements within different data structures. 
Pointers are a source of confusion for novice programmers, but the underlying 
concepts are fairly simple. Here we highlight some key principles of pointers and 
their mapping into machine code. 


* Every pointer has an associated type. This type indicates what kind of object 
the pointer points to. Using the following pointer declarations as illustrations 


int *ip; 
char **cpp; 


variable ip is a pointer to an object of type int, while cpp is a pointer to an 
object that itself is a pointer to an object of type char. In general, if the object 
has type T, then the pointer has type *T. The special void * type represents a 
generic pointer. For example, the malloc function returns a generic pointer, 
which is converted to a typed pointer via either an explicit cast or by the 
implicit casting of the assignment operation. Pointer types are not part of 
machine code; they are an abstraction provided by C to help programmers 
avoid addressing errors. 


Every pointer has a value. This value is an address of some object of the 
designated type. The special NULL (0) value indicates that the pointer does 
not point anywhere. 


* Pointers are created with the ‘&’ operator. This operator can be applied to any 
C expression that is categorized as an /value, meaning an expression that can 
appear on the left side of an assignment. Examples include variables and the 
elements of structures, unions, and arrays. We have seen that the machine- 
code realization of the ‘&’ operator often uses the leaq instruction to compute 
the expression value, since this instruction is designed to compute the address 
of a memory reference. 


* Pointers are dereferenced with the ‘*’ operator. The result is a value having the 
type associated with the pointer. Dereferencing is implemented by a memory 
reference, either storing to or retrieving from the specified address. 


* Arrays and pointers are closely related. The name of an array can be referenced 
(but not updated) as if it were a pointer variable. Array referencing (e.g., 
a[3]) has the exact same effect as pointer arithmetic and dereferencing (e.g., 
*(a*3)). Both array referencing and pointer arithmetic require scaling the 
offsets by the object size. When we write an expression p*i for pointer p with 
value p, the resulting address is computed as p + L - i, where L is the size of 
the data type associated with p. 
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* Casting from one type of pointer to another changes its type but not its value. 
One effect of casting is to change any scaling of pointer arithmetic. So, for 
example, if p is a pointer of type char * having value p, then the expression 
(int *) p*7 computes p + 28, while (int *) (p*7) computes p +7. (Recall 
that casting has higher precedence than addition.) 

* Pointers can also point to functions. This provides a powerful capability for 
storing and passing references to code, which can be invoked in some other 
part of the program. For example, if we have a function defined by the proto- 


type 
int fun(int x, int *p); 


then we can declare and assign a pointer £p to this function by the following 
code sequence: 





int (*fp)(int, int *); 
fp - fun; 


We can then invoke the function using this pointer: 


int y = 1; 
int result = fp(3, &y); 


The value of a function pointer is the address of the first instruction in the 
machine-code representation of the function. 
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3.10.2 Life in the Real World: Using the cbe Debugger 


The GNU debugger Gps provides a number of useful features to support the 
run-time evaluation and analysis of machine-level programs. With the examples 
and exercises in this book, we attempt to infer the behavior of a program by 
just looking at the code. Using GDB, it becomes possible to study the behavior 
by watching the program in action while having considerable control over its 
execution. 

Figure 3.39 shows examples of some Gps commands that help when working 
with machine-level x86-64 programs. It is very helpful to first run OBJDUMP to get 
a disassembled version of the program. Our examples are based on running GDB 
on the file prog, described and disassembled on page 175. We start Gps with the 
following command line: 


linux? gdb prog 


The general scheme is to set breakpoints near points of interest in the pro- 
gram. These can be set to just after the entry of a function or at a program address. 
When one of the breakpoints is hit during program execution, the program will 
hait and return control to the user. From a breakpoint, we can examine different 
registers and memory locations in various formats. We can also single-step the 
program, running just a few instructions at a time, or we can proceed to the next 
breakpoint. 

As our examples'suggest, GDB has an obscure command syntax, but the online 
help information (invoked within cps with the help command) overcomes this 
shortcoming. Rather than using the command-line interface to GDB, many pro- 
grammers prefer using DDD, an extension to GDB that provides a graphical user 
interface. 


3.10.3 Out-of-Bounds Memory References and Buffer Overflow 


We have seen that C does not perform any bounds checking for array references, 
and that local variables are stored on the stack along with state information such 
as saved register values and return addresses. This combination can lead to serious 
program errors, where the state stored on the stack gets corrupted by a write to an 
out-of-bounds array element. When the program then tries to reload the register 
or execute a ret instruction with this corrupted state, things can go seriously 
wrong. 

A particularly common source of state corruption is known as buffer overflow. 
Typically, some character array is allocated on the stack to hold a string, but the 
size of the string exceeds the space allocated for the array. This is demonstrated 
by the following program example: 


/* Implementation of library function gets() */ 
char *gets(char *s) 
{ 


int c; 
char *dest = s; 
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zu 


Starting and stopping 
quit 

run 

kill 


Breakpoints 
break multstore 
break *0x400540 
delete 1 

delete 


Execution 

stepi 

stepi 4 

nexti t 
continue 

finish 

Examining code 

disas 

disas multstore 
digas 0x400544 

disas 0x400540, 0x40054d 
print /x $rip 


Examining data 
print $rax 

print /x $rax 
print /t $rax 
print 0x100 
print /x 555 
print /x ($rsp*8) 


print *(long *) Ox7fffffffe818 


print «(long *) ($rsp*8) 
x/2g Ox7fffffffes13 
x/20b multstore 


Useful information 
info frame 

info registers 
help 


Figure 3.39 Example GDB commands. These examples illustrat 


of machine-level programs. 


Effect i 


Command 
F 


Exit GDB 


Run your program (give command-line argurhents here) 
1 


Stop your program 
S 


Set breakpoint at entry to function multstore 
Set breakpoint at'address 0x400540 

Delete breakpoint 1 

Delete all breakpoints 


Execute one instruction 

Execute four instructions 

Like stepi, but-proceed through function calls 
Resumé execution / 


Rui-until carrent£ünction retürns 
h e ox H 


Disassemble current function 

Disassemble function multstore 

Disassemble function around address 0x400544 
Disassemble codeiwithin specified address range 
Print program counter in-hex 


Print contents of %rax in decimal 

Print contents of %rax in hex 

Ptint contents of 4rax in binary 

Print decimal representation of 0x100 

Print hex representation of 555 — 

Print contents of 4rsp plus 8 in hex 

Print long integer at address Ox7fffffffe818 
Print long integer at address Zrsp + 8 

Examine two (8-byte) words starting at address 
Examine first 20 bytes of function multstore 


Information about current stack frame 
Values of all the registers 
Get information about GDB 


e some of the ways GDB 
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Figure 3.40 
Stack organization for iet 
echo function. Character tack frame 
+e for caller 
array buf is just part of 
the saved state, An out-of- 
bounds write to buf can siki LIE ENTE X 
ack trame poxbaeo f x git, ie bw 
corrupt the*program state. for echo Re T urn] 
while ((c = getcharO) != '\n' && c f= EOF) 


*dest++ = c; 
if (c == EOF && dest ==- 5) 

/* No characters read */ 

return NULL; 
*dest++ = 'N0'; /* Terminate string */ 
return s; 


/* Read input line and write it back */ 
void echo() 


{ 
char buf(8]; /* Way too small! */ ~ 
gets (buf); 
puts (buf) ; 

l 


The preceding code shows an implementation of the library function gets 
to demonstrate a serious problem with this function. It reads a line from the 
standard input, stopping when either a terminating newline character or some 
error condition is encountered. It copies this string to the location designated by 
argument s and terminates the string with a null character. We show the use of 
gets in the function echo, which simply reads a line from standard input and echos 
it back to standard output. 

The problem with gets is that it has no way to determine whether sufficient 
space has been allocated to hold the entire String. In our echo example, we have 
purposely made the buffer very small—just eight. characters long. Any string 
longer than seven characters will cause an out-of-bounds write. 

By examining the assembly code generated by acc for echo, we can infer how 
the stack is organized: 


void echo() 


1 echo: 

2 subq $24, rsp Allocate 24 bytes on stack 
3 movq rsp, %rdi Compute buf as {rsp 

4 call gets Call gets 

5 movq “rsp, %rdi Compute buf as %rsp 
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6 call puts Call puts 
7 addq $24, Arsp Deallocate stack space 
ret Return 


Figure 3.40 illustrates the stack organization during the execution of echo. The 
program allocates 24 bytes on the stack by subtracting 24 from the stack pointer 
(line 2). Character buf is positioned at the top of the stack, as can be seen by the 
fact that %rsp is copied to %rdi to be used as the argument to the calls to both 
gets and puts. The 16 bytes between buf and the stored return pointer are not 
used. As long as the user types at most seven characters, the string returned by 
gets (including the terminating null) will fit within the space allocated for buf. 
A longer string, however, will cause gets to overwrite some of the information 
stored on the stack. As the string gets longer, the following information will get 
corrupted: 


Characters typed Additional corrupted state 


0-7 None 

9-23 Unused stack space 
24-31 Return address 

32+ Saved state in caller 


No serious consequence occurs for strings of up to 23 characters, but beyond 
that, the value of the return pointer, and possibly additional saved state, will 
be corrupted. If the stored value of the return address is corrupted, then the 
ret instruction (line 8) will cause the program to jump to a totally unexpected 
location. None of these behaviors would seem possible based on the C code. The 
impact of out-of-bounds writing to memory by functions such as gets can only be 
understood by studying the program at the machine-code level. 

Our code for echo is simple but sloppy. A better version involves using the 
function fgets, which includes as an argument a count on the maximum number 
of bytes to read. Problem 3.71 asks you to write an echo function that can handle 
an input string of arbitrary length. In general, using gets or any function that 
can overfiow storage is considered a bad programming practice. Unfortunately, 
a number of commonly used library functions, including strcpy, strcat, and 
sprintf, have the property that they can generate a byte sequence without being 
given any indication of the size of the destination buffer [97]. Such conditions can 
lead to vulnerabilities to buffer overflow. 


Figure 3.41 shows a (low-quality) implementation of a function that reads a line 
from standard input, copies the string to newly allocated storage, and returns a 
pointer to the result. 

Consider the following scenario. Procedure get. line is called with the return 
address equal to 0x400776 and register 4rbx equal to 0x0123456789ABCDEF. You 
type in the string 


0123456789012345678901234 
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(a) C code 


/* This is very low-quality code. 
It is intended to illustrate bad programming’ practices. 
See Practice Problem 3.46. */ 

char *get_line() 

1 

char buf[4]; 

char *result; 

gate (buf) ; D», 
result .= malloc(strien(buf)); 
strcpy(result, buf); 

return result; 


} 
(b) Disassembly up through call to gets 


char *get line() 
0000000000400720 «get line^: 

400120: 53 push %rbx 
4,400721: , 48 83 ec 10 sub $0x10,%rsp 
Diagram stack at this point, 

400725: 48 89 e7 mov A4rsp,Ardi 

400728: e8 73 ff ff ff callq 400620 «gets» 
Modify diagram to show stack contents at this point 


Figure 3.41 C and disassembled code for Practice Problem 3.46. 


The program terminates with a segmentation fault. You run-cps dnd determine 
that the error occurs during the execution of the ret instruction of get_line. 


A, Fillin the diagram that follows, indicating as much as you can about the stack 
just after executing the instruction at line 3 in tlie disassembly. Label the 
quantities stored on the stack (e.g., “Return address") on the right, and their 
hexadecimal values (if known) within the box, Each box represents 8 bytes. 
Indicate the position of %rsp. R&cáll that the ASCH cBdes for characters 0-9 
are 0x30-0x39. 


00 00 00 00 00 40 00 76] Retum address 


B. Modify your diagram to show the effect of the call to gets (line 5). 
C. To what address does the program attempt to return? 
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D. What register(s) have corrupted value(s) when get_line returns?- 


E. Besides the potential for buffer overflow, what two other things are wrong 
with the code for get. iine? 


Oo Mitre COUR LL ees 


A. more pernicious use of buffer overflow is to get a program to perform 
a function that it would otherwise be unwilling to do. This is one of the most 
common methods to attack the security of a system over a computer network. 
Typically, the program is fed with a string that contains the byte encoding of some 
executable code, called the exploit code, plus soníe extra bytes that overwrite the 
return address with a pointer to the exploit code. The effect of executing-the ret 
instruction is then to jump to the exploit code. 

In one form of attack, the exploit code then uses a system call to start up à 
shell program, providing the attacker with a range of operating system functions. 
In another form, the exploit code performs some otherwise unauthorized task, 
repairs the damage to the stack, and then executes ret a second time, causing an 
(apparently) normal return to the caller. j 

As an example, the famous Internet worm of November 1988 ušed four dif- 
ferent ways to gain access to many of the computers across thé Internet. One was 
a buffer overfiow attack on the finger daemon f ingerd, which serves requests by 
the FINGER command. By invoking FINGER with an appropriate string, the worm 
could make the daemon at a remote site have a buffer overflow and ‘execute code 
that gave the worm access to the remote system. Once the worm gained access toa 
system, it would replicate itself and consume virtually all of the machine’s comput- 
ing resources. As a consequence, hundreds of machines were effectively paralyzed 
until security experts could determine how to eliminate the worm. The author of 
the worm was caught and prosecuted. He was sentenced to 3 years probation, 400 
hours of community service, and a $10,500 fine. Even to this day, however, people 
continue to find security leaks in systems that leave them vulnerable to buffer 
overflow attacks. This highlights the need for careful programming, Any interface 

i : 1 i . i 
to the external environment should be made “bulletproof” so that no behavior by 
an external agent can cause the system to misbehave. 


3.10.4. Thwarting Buffer Overflow Attacks 


"ir 


Buffer overflow attacks have become so pervasive and have caused so many 
problems with computer systems that modern compilers and operating systems 
have implemented mechanisms to make it more difficult to mount these-attacks 
and to limit the ways by which an intrüder can seize control of a system via a buffer 
overflow attack. In this section, we will present mechanisms that are provided by 
recent versions of ccc for Linux. 


Stack Randomization 


In order to insert exploit, code into a system, the attacker needs to inject both ] 
the code as well as a pointer to this code as pait of the attack string. Generating | 
2 
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this pointer requires knowing the stack address where the string will be located. 
Historically, the stack addresses for a program were highly predictable. For ail 
systems running the same combination of program and operating system version, 
the stack locations were fairly stable across many machines. So, for example, if 
an'attacket could determine the stack addresses used by a common Web server, 
it could devise an attack that would work on many machines. Using infectious 
disease as an analogy, many systems were vulnerable to the exact same strain of 
a virus, a phenomenon often referred to as a security monoculture [96]. 

The idea of stack randomization is to make the position of the stack vary from 
onerunofa program to another. Thus, even if many machines are running identical 
code, they would all be using different stack addresses. This is implemented by 
allocating a random amount of space between 0 and n bytes on the stack at the 
start of a program, for example, by using the allocation function alloca, which 
allocates space for a specified number of bytes on the stack. This allocated space is 
not used by the program, but it causes all subsequent stack locations to vary from 
one execution of a program to another. The allocation range n needs to be large 
enough to get sufficient variations in the stack addresses, yet small enough that it 
does not waste too much space in the program. 

The following code shows a simple way to determine a "typical" stack address: 


int main() (1 
long local; 
printf("local at %p\n", &local); 
return 0; 


} 


This code simply prints the address of a local variable in the main function. 
Running the code 10,000 times on a Linux machine in 32-bit mode, the addresses 
ranged from Oxff7fc59c to Oxffffd09c, a range of around 273, Running in 64- 
bit mode on the newer machine, the addresses ranged from 0x7£££0001b698 to 
0x7£fffffaa4aB, a range of nearly 27^. 

Stack randomization has become standard practice in Linux systems. It is 
one of a larger class of techniques known as address-space layout randomization, 
or ASLR [99]. With ASLR, different parts of the program, including program 
code, library code, stack, global variables, and heap data, are loaded into different 
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regions of memory each time a program is run. That means that a program running 

on one machine will have very different address mappings than the same program 
f running on other machines. This can thwart some forms of attack. 

Overail, however, a persistent attacker can overcome randomization by brute 

Ai , force, repeatedly attempting attacks with different addresses. A common trick is 

to include a long sequence of nop (pronounced “no op,” short for “no operation”) 

: instructions before the actual exploit code. Executing this instruction has no ef- 

| fect, other than incrementing the program counter to the next instruction. As jong 











as the attacker can guess an address somewhere within this sequence, the program | 
will run through the sequence and then hit the exploit code. The common term for | 
this sequence is a “nop sled” [97], expressing the idea that the program “slides” 
through the sequence. If we set up a 256-byte nop sied, then the randomization 
4 over n = 2? can be cracked by enumerating 21? = 32,768 starting addresses, which 
i is entirely feasible for a determined attacker. For the 64-bit case, trying to enumer- 
4 ate 274 — 16,777,216 is a bit more daunting. We can see that stack randomization 
if { and other aspects of ASLR can increase the effort required to successfully attack a 
system, and therefore greatly reduce the rate at which a virus or worm can spread, 
but it cannot provide a complete safeguard. 













sion 2.6.16, we obtained addresses ranging from a minimum of Oxff ffb754 toa 
maximum of Oxffffd754. 1 
A. What is the approximate range of addresses? | 


a | B. If we attempted a buffer overrun with a 128-byte nop sled, about how many 
E attempts would it take to test all starting addresses? 










Stack Corruption Detection 


4 A second line of defense is to be able to detect when a stack has been corrupted. 
4 j We saw in the example of the echo function (Figure 3.40) that the corruption @ 

typically occurs when the program overruns the bounds of a local buffer. In C, 
there is no reliable way to prevent writing beyond the bounds of an array. Instead, 
the program can attempt to detect when such a write has occurred before it can 












| | have any harmful effects. a 
E Recent versions of Gcc incorporate a mechanism known as a stack protector | 
; into the generated code to detect buffer overruns. The idea js to store a special 1 | 







canary value‘ in the stack frame between any local buffer and the rest of the stack 
state, as illustrated in Figure 3.42 (26, 97]. This canary value, also referred to as a 
guard value, is generated randomly each time the program runs, and so there is no 








4. The term "canary" refers to the historic use of these birds to detect the presence of dangerous gases 





in coal mines. 
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Stack frame 
for Galler 


[Return address |-«— trsp«24 
Ejeta ao vue = rsp 


Figure 3.42 Stack organization for echo function with stack protector enabled. A 
special “canary” value is positioned between array buf and the saved state. The code 
checks the canary value to determine whether or not the stack state has been corrupted. 


Stack frame 
for echo 


easy way for an attacker to determine what it is. Before restoring the register state 
andreturning from the function, the program checks if the canary has been altered 
by some operation of this function or one that it has called. If so, the program 
aborts with an error. 

Recent versions of acc try to determine whether a function is vulnerable to 
a stack overflow and insert this type of overflow detection automatically. In fact, 
for our earlier demonstration of stack overflow, we had to give the command-line 
option -£no-stack-protectorto prevent ccc from inserting this code. Compiling 
the function echo without this option, and hence with the stack protector enabled, 
gives the following assembly code: 


void echo() 

echo: 
subq $24, žrsp Allocate 24 bytes on stack 
movq ^fs:40, %rax Retrieve canary 
movq 4rax, 8(%rsp) Store on stack 
xorl eax, %eax Zero out register 
movq “rsp, %rdi Compute buf as Xrsp 
call gets Call gets 
movq “rsp, %rdi Compute buf as rsp 
call puts Call puts 
movq 8C sp), %rax Retrieve canary 
Xorq %fs:40, %rax Compare to stored value 
je .L9 If -, goto ok 
call -.-8tack chk fail Stack corrupted! 

.L9: ok: 
addq $24, %rsp Deallocate stack space 
ret 


We see that this version of the function retrieves a value from memory (line 3) 
and stores it on the stack at offset 8 from Arsp, just beyond the region allocated for 
buf. The instruction argument %fs : 40 is an indication that the canary value is read 
from memory using segmented addressing, an addressing mechanism that dates 
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back to the 80286 and is seldom found in, programs running on modern systems, 
By storing the canary in a special segment, it can be marked as “read only,” so 
that an attacker cannot overwrite the stored canary value. Before restoring the 
register state and returning, the function compares the value stored at thé stack 
location with the canary value (via the xorq instruction on line 11). If the two are 
identical, the xorq instruction will yield zero, and the function will complete in the 
normal fashion. A nonzero value indicates that the canary on the stack has been 
modified, and so the code will call an error routine. 

Stack protection does a good job of preventing a buffer overflow attack from 
corrupting state stored on the program stack. It incurs only a small performance 
penalty, especially because ccc only inserts it when there is a local buffer of 
type char in the function. Of course, there are other ways to corrupt the state 
of an executing program, but reducing the vulnerability of the stack thwarts many 


common attack strategies. 





The functions intlen, len, and iptoa provide a very convoluted way to compute 
the number of decimal digits required tó represent an integer. We will use this as 
a way to study some aspects of the acc staCk-brotector facility. 


int len(char *s) { 
return strlen(s); 


) 


void iptoa(char *s, long *p) { 
long val - *p; 
sprintf(s, "Zld", val); 


} 

int intlen(long x) { 
long v; 
char buf [12]; 
VEX 


iptoa(buf, &v); 
return len(buf); 


The following show portions of the code for intlen, compiled both with and 
without stack protector: 


(a) Without protector 


int intlen(long x) 
x in &rdi D ` 
intlen: A et 

subq. $40, %rsp % 

movq Xrdi, 24(Xrsp) 


w N = 
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4 leaq 24(4rsp), %rsi 
5 movq ‘rsp, 4rdi 
6 call iptoa 


(b) With protector 


int intlen(long x) 
x in rdi 

1 intlen: 

2 subq $56, Xrsp 

3 movq %fs:40, %rax 

4 movg frax, 40(4rsp) 
5 xorl %eax, Aeax 

6 movq žrdi, 8(%rsp) 
7 leaq 8(%rsp), %rsi 
8 leaq 16Cérsp), žrdi 
9 call iptoa 


A. For both versions: What are the positions in the stack frame for buf, v, and 
(when present) the canary value? 


B. How does the rearranged ordering of the local variables in the protected 
code provide greater security against a buffer overrun attack? 


Limiting Executable Code Regions 


A final step is to eliminate the ability of an attacker to insert executable code into 
a system. One method is to limit which memory regions hold executable code. 
In typical programs, only the portion of memory holding the code generated by 
the compiler need be executable. The other portions can be restricted to allow 
just reading and writing. As we will see in Chapter 9, the virtual memory space 
is logically divided into pages, typically with 2,048 or 4,096 bytes per page. The 
hardware supports different forms of memory protection, indicating the forms of 
access allowed by both user programs and the operating system kernel. Many sys- 
tems allow control over three forms of access: read (reading data from memory), 
write (storing data into memory), and execute (treating the memory contents as 
machine-level code). Historically, the x86 architecture merged the read and exe- 
cute access controls into a single 1-bit flag, so that any page marked as readable 
was also executable. The stack had to be kept both readable and writable, and 
therefore the bytes on the stack were also executable. Various schemes were im- 
plemented to be able to limit some pages to being readable but not executable, 
but these generally introduced significant inefficiencies. 

More recently, AMD introduced an NX (for “no-execute”) bit into the mem- 
ory protection for its 64-bit processors, separating the read and execute access 
modes, and Intel followed suit. With this feature, the stack can be marked as be- 
ing readable and writable, but not executable, and the checking of whether a page 
is executable is performed in hardware, with no penalty in efficiency. 
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Some types of programs require the ability to dynamically generate and ex- 
ecute code. For example, “just-in-time” compilation techniques dynamically gen- 
erate code for programs written in interpreted languages, such as Java, to improve 
execution performance. Whether or not the run-time system can restrict the ex- 
ecutable code to just that part generated by the compiler in creating the original 
program depends on the language and the operating system. 


The techniques we have outlined—randomization, stack protection, and lim- 
iting which portions of memory can hold executable code—are three of the most 
common mechanisms used to minimize the vulnerability of programs to buffer 
overflow attacks. They all have the properties that they require no special effort 
on the part of the programmer and incur very little or no performance penalty. 
Each separately reduces the level of vulnerability, and in combination they be- 
come even more effective. Unfortunately, there are still ways to attack computers 
[85, 97], and so worms and viruses continue to compromise the integrity of many 
machines, 


3.10.5 Supporting Variable-Size Stack Frames 


We have examined the machine-level code for a variety of functions so far, but 
they all have the property that the compiler can determine in advance the amount 
of space that must be allocated for their stack frames. Some functions, however, 
require a variable amount of local storage. This can occur, for example, when the 
function calls alloca, a standard library function that can allocate an arbitrary 
number of bytes of storage on the stack. Jt can also occur when the code declares 
a local array of variable size. 

Although the information presented in this section should rightfully be con- 
sidered an aspect of how procedures are implemented, we have deferred the 
presentation to this point, since it requires an understanding of arrays and align- 
ment. 

The code of Figure 3.43(a) gives an example of a function containing a 
variable-size array. The function declares local array p of n pointers, where n is 
given by the first argument. This requires allocating 8n bytes on the stack, where 
the value of n may vary from one call of the function to another. The compiler 
therefore cannot determine how much space it must allocate for the function's 
stack frame. In addition, the program generates a reference to the address of local 
variable i, and so this variable must also be stored on the stack. During execution, 
the program must be able to access both local variable i and the elements of array 
p. On returning, the function must deallocate the stack frame and set the stack 
pointer to the position of the stored return address. 

To manage a variable-size stack frame, x86-64 code uses register %rbp to serve 
as a frame pointer (sometimes referred to as a base pointer, and hence the letters 
bp in 4rbp). When using a frame pointer, the stack frame is organized as shown 
for the case of function vframe in Figure 3.44. We see that the code must save 
the previous version of 4rbp on the stack, since it is a callee-saved register. It then 
keeps %rbp pointing to this position throughout the execution of the function, and 
it references fixed-length loca] variables, such as i, at offsets relative to %rbp. 
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(a) C code 


long vframe(long n, long idx, long *q) ( 
long i; 
long *p[n]; 
p[0] = &i; 
for (i = 1; i« n; i++) 
pli] = q; 
return *p[idx]; 


} 
(b) Portions of generated assembly code 


long vframe(long n, long idx, long *q) 
n in %rdi, idx in %rsi, q in Zrdx 
Only portions of code shown 
vframe: >- 
pushq <%rbp Save old Zrbp 
movq %rsp, Arbp Set frame pointer 
subq $16, Xrsp Allocate space for i (%rsp = s) 
lead 22(,%rdi,8), %rax 
andq $-16, %rax 
subq grax, %rsp Allocate space for array p (Xrsp = sj) 
Teaq 7(%rsp), %rax 
shrq $3, %rax 
leaq OC,AZrax!8), %r8 Set %r8 to &p[0] 
movq 4r8, Xrcx Set Zrcx'to &p[0] (Xrcx = p) 


t 


1 
2 
3 
4 
5 
6 
7 
8 
9 


—- = 
—-— e 


"Code for initializátion loop 
i in Zraffand on stack, n in Zrdi, p in Xrcx, q in Xrdx 
L3: loop: 
movq — "Ardx, Circe, %rak,8) Set pli] to q 
addq $1, %rax Increment i 
movq 4rax, -8(Àrbp) Store on stack 
.L2: 
movq -8 (%rbp) , rax Retrieve i from stack 
cmpq žrdi, %rax Compare i:n 
jl .L3 If <, goto loop 
at t . a 
Code for function exit 
1 leave Restore %rbp and Xrsp 
B an ret Return 


l 1 Figure 3.43 Function requiring the use of a frame pointer. The variable-size array implies that the size of 
E the stack frame cannot be determined at compile time. 


it 








Figure 3.44 

Stack frame structure 

for function vframe. Frame pointer 
The function uses register hrbp 
%rbp as a frame pointer. 

The annotations along 

the right-hand side are 

in reference to Practice 

Problem 3.49. 





8n bytes 


Stack pointer 
“rsp 


Figure 3.43(b) shows portions of the code acc generates for function vframe. 
At the beginning of the function, we see code that sets up the stack frame,and 
allocates space for array p. The code starts by pushing the current value of %rbp 
onto the stack and setting %rbp to point to this stack position (lines 2-3). Next, it 
allocates 16 bytes on the stack, the first 8 of which are used to store local variable 
i, and the second 8 of which are unused. Then it allocates space for array p (lines 
5-11). The details of how much space it allocates and where it positions p within 
this space are explored in Practice Problem 3.49. Suffice it to say that by the time 
the program reaches line 11, it has (1) allocated at least 8n bytes on the stack and 
(2) positioned array p within the allocated region such that at least 8n bytes are 
available for its use. 

The code for the initialization loop shows examples of how local, variables 
i and p are referenced. Line 13 shows array element p[i] being set to q. This 
instruction uses the value in register %4rcx as the address for the start of p. We can 
see instances where local variable i is updated (line 15) and read (line 17). The 
address of i is given by reference -8 (%xbp)—that is, at-offset —8 relative to the 
frame pointer. 

At the end of the function, the frame pointer is restored to its previous value 
using the leave instruction (line 20). This instruction takes no arguments. It is 
equivalent to executing the following two instructions: 


MTM Rr s tine D 


movq V1rbp, ArSp Set stack pointer to beginning of frame 
popa Arbp Restore saved %rbp and set stack ptr 
to end of caller's frame i 


That is, the stack pointer is first set to the position of the saved value of Zrbp, and 
then this value is popped from the stack into Zrbp. This instruction combination EE 
has the effect of deallocating the entire stack frame. al 








an) - 
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In earlier versions of x86 code, the frame pointer was used with every function 
call. With x86-64 code, it is used only in cases where the stack frame may be of 
variable size, as is the case for function vframe. Historically, most compilers used 
frame pointers when generating IA32 code. Recent versions of acc have dropped 
this convention. Observe that it is acceptable to mix code that uses frame pointers 
with code that does not, as long as all functions treat %rbp as a callee-saved register, 
i a ee 


ne ade sen Mediaset Babes fita, Da 


the code in lines 5-11 of Fig- 
ure 3.43(b), where space is allocated for variable-size array p. As the annotations 
of the code indicate, let us let s; denote the address of the stack pointer after exe- 
cuting the subq instruction of line 4. This instruction allocates the space for local 
variable i. Let s; denote the value of the stack pointer after executing the subq 
instruction of line 7. This instruction allocates the storage for local array p. Finally, 
let p denote the value assigned to registers %r8 and %rcx in the instructions of lines 
10-11. Both of these registers are used to reference array p. 

, The right-hand side of Figure 3.44 diagrams the positions of the locations 
indicated by s4, s2, and p. It also shows that there may be an offset of e; bytes 
between the values of s, and p. This space will not be used. There may also be an 
offset of e; bytes between the end of array p and the position indicated by s4. 






A. Explain, in mathematical terms, the logic in the computation of Sy on lines 
5-7. Hint: Think about the bit-level representation of —16 and its effect in 
the andq instruction of line 6. 


B. Explain, in mathematical terms, the logic in the computation of p on lines 
8-10. Hint: You may want to refer to the discussion on division by powers 
of 2 in Section 2.3.7. 


C. For the following values of n and sı, trace the execution of the code to 
determine what the resulting values would be for s», p, e;, and ez. 


eee e LL 
5 2065 __ mu 
6 2,064 is 





D. What alignment properties does this code guarantee for the values of $2 
and p? 





3.11 Floating-Point Code 


The floating-point architecture for a processor consists of the different aspects 
that affect how programs operating on floating-point data are mapped onto the 
machine, including ` 


* How floating-point values are stored and accessed. This is typically via some 
form of registers. 
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* The instructions that operate on floating-point data. 


¢ The conventions used for passing floating-point values as arguments to func- 
tions and for returning them as results, 


* The conventions for how registers are preserved during function calls—for 
example, with some registers designated as caller saved, and others as callee 
saved. 


To understand the x86-64 floating-point architecture, it is helpful to have a 
brief historical perspective. Since the introduction of the Pentium/MMX in 1997, 
both Intel and AMD have incorporated successive generations of media instruc- 
tions to support graphics and image processing. These instructions originally fo- 
cused on allowing multiple operations to be performed in a parallel mode known 
as single instruction, multiple data, or SIMD (pronounced sim-dee). In this mode 
the same operation is performed on a number of different data values in parallel. 
Over the years, there has been a progression of these extensions. The namies have 
changed through a series of major revisions from MMX to SSE (for “strearhing 
SIMD extensions") and most recently AVX (for “advanced vector extensions"). 
Within each generation, there have also been different versions. Each of these ex- 
tensions manages data in sets of registers, referred to as “MM” registers for MMX, 
“XMM” for SSE, and “YMM” for AVX, ranging from 64 bits for MM registers, 
to 128 for XMM, to 256 for YMM. So, for example, each YMM register can hold 
eight 32-bit values, or four 64-bit values, where these values can be either integer 
or floating point. 

Starting with SSE2, introduced with the Pentium 4 in 2000, the media in- 
structions have included ones to operate on scalar floating-point data, using single 
values in the low-order 32 or 64 bits of XMM or YMM registers. This scalar mode 
provides a set of registers and instructions that are more typical of the way other 
processors support floating point. All processors capable of executing x86-64 code 
support SSE2 or higher, and hence x86-64 floating point is based on SSE or AVX, 
including conventions for passing procedure arguments and return values [77]. 

Our presentation is based on AVX2, the second version of AVX, introduced 
with the Core i7 Haswell processor in 2013. Gcc will generate AVX2 code when 
given the command-line parameter -mavx2. Code based on the different versions 
of SSE, as well as the first version of AVX, is conceptually similar, although they 
differ in the instruction names and formats. We present only instructions that 
arise in compiling floating-point programs with Gcc. These are, for the most part, 
the scalar AVX instructions, although we document occasions where instructions 
intended for operating on entire data vectors arise. A more complete coverage 
of how to exploit the SIMD capabilities of SSE-and AVX is presented int Web 
Aside OPT:SIMD on page 546. Readers may wish to refer to the AMD and Intel 
documentation for the individual instructions [4, 51]. As with integer operations, 
note that the ATT format we use in our presentation differs from the Intel format 
used in these documents. In particular, the instruction operands are listed in a 
D different order in these two versions. 
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Figure 3.45 Media registers. These registers are used to hold floating-point data. 
Each YMM register holds 32 bytes. The low-order 16 bytes can be accessed as an XMM 
register. 


As is illustrated in Figure 3.45, the AVX floating-point architecture allows 
data to be stored in 16 YMM registers, named ZymmO—/ymm15. Each YMM register 
is 256 bits (32 bytes) long. When operating on scalar data, these registers only 
hold floating-point data, and only the low-order 32 bits (for float) or 64 bits (for 
double) are used. The assembly code refers to the registers by their SSE XMM 
register names %xmmO0—%xmm15, where each XMM re gister is the low-order 128 bits 
(16 bytes) of the corresponding YMM register, 
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Instruction Source Destination Description 


vmovss M32 X Move single precision 

vmovss X My Move single precision 

vmovsd Mea X Move double precision 

vmovsd X Mg Move double precision 

vmovaps X X Move aligned, packed single precision 
vmovapd X X Move aligned, packed double precision 


Figure 3.46 Floating-point movement instructions. These operations transfer values 
between memory and registers, as well as between pairs of registers. (X: XMM register 
(e.g., %xmm3); M35: 32-bit memory range; M4: 64-bit memory range) 


3.11.1 Floating-Point Movement and Conversion Operations 


Figure 3.46 shows a set of instructions for transferring floating-point data between 
memory and XMM registers, as well as from one XMM register to another without 
any conversions. Those that reference memory are scalar instructions, meaning 
that they operate on individual, rather than packed, data values. The data are 
held either in memory (indicated in the table as M32 and Mq4) or in XMM registers 
(shown in the table as X). These instructions will work correctly regardless of the 
alignment of data, although the code optimization guidelines recommend that 32- 
bit memory data satisfy a 4-byte alignment and that 64-bit data satisfy an 8-byte 
alignment. Memory references are specified in the same way as for the integer Mov 
instructions, with all of the different possible combinations of displacement, base 
register, index register, and scaling factor. 

Gcc uses the scalar movement operations only to transfer data from memory 
to an XMM register or from an XMM register to memory. For transferring data 
between two XMM registers, it uses one of two different instructions for copying 
the entire contents of one XMM register to another—namely, vmovaps for single- 
precision and vmovapd for double-precision values. For these cases, whether the 
program copies the entire register or just the low-order value affects neither the 
program functionality nor the execution speed, and so using these instructions 
rather than ones specific to scalar data makes no real difference. The letter ‘a’ 
in these instruction names stands for "aligned." Whén used to read and write 
memory, they will cause an exception if the address does not satisfy a 16-byte 
alignment. For transferring between two registers, there is no possibility of an 
incorrect alignment, 

As an example of the different floating-point move operations, consider the 
C function 


float float mov(float vi, float *src, float *dst) { 
float v2 = *src; 
*dst = vi; 3 
return v2; 
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Instruction Source Destination Description 

vevttss2si X/M32 R32 Convert with truncation single precision to integer 
vevttsd2si X/Mg, Ry, Convert with truncation double precision to integer 
vcvttss2siq X/M33 Rea Convert with truncation single precision to quad word integer 
vcvttsd2siq X/Mg, Req Convert with truncation double precision to quad word integer 


Figure 3.47 Two-operand floating-point conversion operations. These convert floating-point data to 
integers. (X: XMM register (e.g., %xmm3); R39: 32-bit general-purpose register (e.g., 4eax); Req: 64-bit 
general-purpose register (e.g., 4rax); M32: 32-bit memory range; Mg4: 64-bit memory range) 


Instruction Source 1 Source 2 Destination Description 

vevtsi2ss Mx R32 X X Convert integer to single precision 

vcvtsi2sd Maj Ra; X X Convert integer to double precision 
vcvtsi2ssq Meal Rea X X Convert quad word integer to single precision 
vcvtsi2sdq. Mg Rgg X X Convert quad word integer to double precision 


Figure 3.48 Three-operand floating-point conversion operations. These instructions convert from the 
data type of the first source to the data type of the destination. The second source value has no effect on the 
low-order bytes of the result. (X: XMM register (e.g., 4xmm3); M32: 32-bit memory range; Mq4: 64-bit memory 
range) 


and its associated x86-64 assembly code 


float float mov(float vi, float *src, float *dst) 
vi in %xmm0, src in %rdi’, dst in %rsi 


1 float mov: 

2 vmovaps %xmm0, %xmm1 Copy vi 

3 vmovss (%rdi), %xmm0 Read v2 from src 

4 vmovss %xmmi, (%rsi) Write vi tg dst 

5 ret Return y? in AxmmO 


We can'see in this example the use of the vmovaps instruction to copy data from 
one register to another and the use of the vmovss instruction to copy data 
from memory tó an XMM register and from an XMM register to memory. 

Figures 3.47 and 3.48 show sets of instructions for converting between floating- 
point and intéger data types, as well as between different floating-point formats. 
These are all scalar instructions operating on individual data values. Those in 
Figure 3.47 convert from a‘floating-point value read from either an XMM register 
or memory and write the result to a general-purpose register (e.g., 4rax, Lebx, 
etc.). When converting floating-point values to integers, they perform truncation, 
rounding values toward zero, as is required by C and most other programming 
languages. 

The instructions in Figure 3.48 convert from integer to floating point. They 
use an unusual three-operand format, with two sources and a destination. The 
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first operand is read from memory or from a general-purpose register. For our 
purposes, we can ignore the second operand, since its value only affects the upper 
bytes of the result. The destination must be an XMM register. In common usage, 
both the second source and the destination operands are identical, as in the 
instruction 


vcvtsi2sdq Y%rax, %xmmi, Axmmi 


This instruction reads a long integer from register 4rax, converts it to data type 
double, and stores the result in the lower bytes of XMM register %xmm1. 

Finally, for converting between two different floating-point formats, current 
versions of Gcc generate code that requires separate documentation. Suppose 
the low-order 4 bytes of %xmm0 hold a single-precision value; then it would seem 
straightforward to use the instruction 


vcvtss2sd %xmm0, %xmmO, %xmm0 


to convert this to a double-precision value and store the result in the lower 8 bytes 
of register %xmm0. Instead, we find the following code generated by ccc: 


Conversion from single to double precision 
vunpcklps 4xmmO, %xmm0, %xmm0 Replicate first vector element 
vcvtps2pd %xmm0, %xmmO Convert two vector elements to double 


The vunpcklps instruction is normally used to interleave the values in two 
XMM registers and store them in a third. That is, if one source register contains 
words [ss, 52, 51, so] and the other contains words [d3, d», dı, do], then the value 
of the destination register will be [s;, d1, so. dj]. In the code above, we see the 
same register being used for all three operands, and so if the original register 
held values [x3, x2, x1, xo], then the instruction will update the register to hold 
values [x1, x1, xo, xo]. The vcvtps2pd instruction expands the two low-order single- 
precision values in the source XMM register to be the two double-precision values 
in the destination XMM register. Applying this to the result of the preceding 
vunpcklps instruction would give values [dxo, dxo], where dxo is the result of 
converting x to double precision. That is, the net effect of the two instructions is 
to convert the original single-precision value in the low-order 4 bytes of %xmm0 to 
double precision and store two copies of it in %xmm0, It is unclear why ccc generates 
this code. There is neither benefit nor need to have the value duplicated within 
the XMM register. 

Gcc generates similar code for converting from double precision to single 
precision: 


Conversion from double to single precision 
vmovddup 4xmmO, 4xmmo Replicate first vector element 
vcvtpd2psx %xmm0, %xmmO0 Convert two vector elements to single 





Section 3.11 Floating-Point Code 


Suppose these instructions start with register %xmm0 holding two double-precision 
values [x;, xo]. Then the vmovddup instruction will set it to [xo, xo]. The vevtpd2psx 
instruction will convert these values to single precision, pack them into the 
low-order half of the register, and set the upper half to 0, yielding a result 
[0.0, 0.0, xo, xo] (recall that floating-point value 0.0 is represented by a bit pat- 
tern of all zeros). Again, there is no clear value in computing the conversion from 
one precision to another this way, rather than by using the single instruction 


vevtsd2ss AxmmO, AximmO, %xmm0 


As an example of the different floating-point conversion operations, consider 
the C function i 


double fcvt(int i, float *fp, double, *dp, long *lp) 


1 
float f = *fp; double d = *dp; long 1 = *lp; 
*lp - (long) d; 
*fp = (float) i; 
*dp = (double) 1; 
return (double) f; 
} 


and its associated x86-64 assembly code 


double fcvt(int i, float *fp, double *dp, long *lp) 
i in %edi, fp in Xrsi, dp in Zrdx, lp in Zrcx 


1 fcvt: 
2 vmovss (%rsi), %xmm0 Get f = *fp 
3 movq (Arex), %rax Get 1 = *lp 
4 vevttsd2siq (Ardx), %r8 Get d = *dp and convert to long 
5 movq 4r8, Circx) Store at lp 
6 vcvtsi2ss Aedi, %xmmi, Yxmmi Convert i to float 
7 vmovss %xmmi, (%rsi) Store at fp 
8 vcvtsi2sdq “vax, %xmmi, %xmmi Convert I to double 
9 vmovsd Axmmi, (%rdx) Store at dp 
The following two instructions convert f to double 
10 vunpcklps AxmmO, 4xmmO, %xmmO 
n vevtps2pd VxmmO, %xmm0 
12 ret Return f 


All of the arguments to fcvt are passed through the general-purpose registers, 
since they are either integers or pointers. The result is returned in register %xmmO. 
As is documented in Figure 3.45, this is the designatedpreturn register for float 
or double values. In'this code, we see a number of the ‘Movement and conversion 
instructions of Figures 3.46-3.48: as well as Gcc’s preferred method of converting 
from single to double precision. : 
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i, f, d, and 1: 


double fcvt2(int *ip, float *fp, double *dp, long 1) 


í 

int i= *ip; float f = *fp; double d = *dp; 
*ip = (int) vali; 

*fp = (float) val2; 

*dp = (double) val3; 
return (double) val4; 

} 

Determine the mapping, based on the following x86-64 code for the function: 
double fcvt2(int *ip, float *fp, double *dp, long 1) 
ip in Zrdi, fp in %rsi, dp in %rdx, 1 in %rex 
Result returned in %xmm0 

1 fcvt2: 

2 movl (žrdi), %eax 

3 vmovss (%rsi), AxmmO 

4 vcvttsd2si (rdx), %r8d 

5 movi %r8d, (WKrdi) 

6 vcvtsi2ss Weax, 4xmml, Axmmi 
7 vmovss  ^xmmi, (%rsi) . 

8 vcvtsi2sdq Arcx, Wxmmi, xmi 
9 vmovsd  Axmmi, (%rdx) 

10 vunpcklps XxmmO, %xmm0, Ym 
11 vevtps2pd XxmmO, %xmm0 

12 ret 





Practice Problem 3.51 (solution page 348). ee Re | 


The following C function converts an argument of type src_t toa return value of 
type dst_t, where these two types are defined using typedef: 


dest t cvt(src_t x) 

{ 
dest_t y = (dest_t) x; 
return y; 


For execution on x86-64, assume that argument x 4s either in %xmm0 or in 
the appropriately named portion of register 4rdi (ie., 4rdi or Zedi). One or 
two instructions are ‘to be used to perform the type conversion and to copy the 
value to the appropriately named portion of register %rax (integer result) or 
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&xnmO (floating-point result). Show the instruction(s), including the source and 
destination registers. 
T, T, 

long double vcvtsi2sdq %4rdi, 4xmmO 
double int 
double float 

long float 

float long 


Instruction(s) 


3.11.2 Floating-Point Code in Procedures 


With x86-64, the XMM registers are used for passing floating-point arguments to 
functions and for returning floating-point values from them. As is illustrated in 
Figure 3.45, the following conventions are observed: 


* Upto eight floating-point arguments can be passed in XMM registers %£xmm0— 
^xmn7. These registers are used in the order the arguments are listed. Addi- 
tional floating-point arguments can be passed on the stack. 


* A function that returns a floating-point value does so in register 4xmmO. 


* All XMM registers are caller saved. The callee may overwrite any of these 
registers without first saving it. 


When a function contains a combination of pointer, integer, and floating- 
point arguments, the pointers and integers are passed in general-purpose registers, 
while the floating-point values are passed in XMM registers. This means that the 
mapping of arguments to registers depends on both their types and their ordering. 
Here are several examples: 


double fi(int x, double y, long z); 

This function would have x in %edi, y in %xmm0, and z in %rsi. 
double f2(double y, int x, long z); 

This function would have the same register assignment as function f1. 
double fi(float x, double *y, long *z); 


This function would have x in %xmmO, y in %rdi, and z in %rsi. 


for the arguments: 


A. double gi(double a, long b, float c, int d); 
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B. double g2(int a, double *b, float *c, long d); 
C. double g3(double *a, double b, int c, float d); 


D. double g4(float a, int *b, float c, double d); 


3.11.3 Floating-Point Arithmetic Operations 


Figure 3.49 documents a set of scalar AVX2 floating-point instructions that.per- 
form arithmetic operations. Each has either one (S1) or two ($1, 52) source oper- 
ands and a destination operand D. The first source operand 5; can be either an 
XMM register or a memory location. ;Fhe second source,operand and the desti- 
nation operands must be XMM registers. Each operation has an instruction for 
single precisioñ and an instruction for double precision.:The result is stored in the 
destination register. 
As an example, consider the following floating-point function: 


double funct(double ay float x, double b, int i) 
1 
return a*x - b/i; 


) =i 
The x86-64 code is as follows: 


double funct(double a, float x, double b, int i) 

a in'/xmm0, x in %xmml, b in %xmm2, i in Zéódi 

funct: 
The’ following two instructions convert x to double 
vunpcklps Yxmmi, «/xmmi, Xxmmi ` 
vcvtps2pd Vxmmi, %xmmd 
vmulsd %xmm0, %xmmi, %xmm0 Multiply a by x 
vcvtsi2sd Zedi, %xmmi, 4xmmi Convert i to double 
vdivsd %xmm1, %xmm2, %xmm2 ` Compute b/i 


| 
| 


Single Double Effect Description , 


vaddss ^ vaddsd S2 + Sy Floating-point add 

vsubss vsubsd $;— 5, Floating-point subtract 

vmulss vmulsd Sy x Sy Floatihg-point multiply 1 

vdivss X vdivsd $5/8, Floating-point divide 

vmaxss vmaxsd max(S2, Sı) Floating-point maximum 

vminss vminsd min(S2, 54), Floating-point minimum 

sqrtss sqrtsd J/5 Floating-point square root 4 


M hdi. d n men y d 


Figure 3.49 Scalar floating-point arithmetic operations. These have either one or 
two source operands and a destination operand. 
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vsubsd %xmm2, %xmm0, %xmmO Subtract from a*x 
ret Return 


The three floating-point arguments a, x, and b are passed in XMM registers 
Axnm0—4xum2, while integer argument i is passed in register %edi. The standard 
two-instruction sequence is used to convert argument x to double (lines 2-3). 
Another conversion instruction is required to convert argument i to double (line 
5). The function value is returned in register %xmm0. 


b Ne e i ' Si aiii ANN UP, 
LOC n S X T s 
TAMEN bin om LBS 


guments are defined by 
typedef: 


double functi(argi_t p, arg2 t q, arg3 t r, arg4_t s) 
1 
return p/(qtr) - s; 


} 
When compiled, ccc generates the following code: 


double functi(argi t p, arg2_t q, arg3_t r, arg4 t s) 
functi: 

vevtsi2ssq Arsi, %xmm2, %xmm2 

vaddss %xmm0, %xmm2, %xmm0 

vevtsi2ss “edi, %xmm2, %xmm2 

vdivss ‘%xmm0, %xmm2, %xmm0 

vunpcklps VxmmO, %xmmO, %xmmO 

vevtps2pd %xmm0, %xmmO 

vsubsd %xmmi, %xmm0, %xmm0 

ret 


Determine the possible combinations of types of the four arguments (there 
may be more than one). 


EP POC eat er Caer ee RTPA S T EE ende "panne oe, vss 
oD x d VHICHAES NE ee O^ Wc WE CAP gaan 
ice: Problem 3:54 s(splution ga watches | veas e bte 

Function funct2 


double funct2(double w, int x, float y, long z); 
Gcc generates the following code for the function: 


double funct2(double w, int x, float y, long z) 
win 4xmmO, x in Zedi, y in Zxumi, z in Zrsi 
funct2: 
vcvtsi2ss Zedi, %xmm2, %xmm2 
vmulss %xmmi, %xmm2, %xmm1 


303 








304  Chapter3 Machine-Levei Representation of Programs 


vdivsd %xmm1, %xmm0, %xmm0 
vsubsd %xmmO, %xmm2, %xmm0 
ret 


2 0 4 A WW A 


Write a C version of funct2. 


vunpcklps *xmmi, %xmmi, %xmmi 
vcvtps2pd ^xmmi, Xxmm2 
vevtsi2sdq Yrsi, %xmmi, %xmmi 





3.11.4 Defining and Using Floating-Point Constants 


Unlike integer arithmetic operations, AVX floating-point operations cannot have 
ij immediate values as operands. Instead, the compiler must allocate and initialize 


1 storage for any constant values. The code then reads the values from memory. This 


double cel2fahr(double temp) 


is illustrated by the following Celsius to Fahrenheit conversion function: 


E. { 

F return 1.8 * temp + 32.0; 

| } 

i The relevant parts of the x86-64 assembly code are as follows: 

a double cel2fahr(double temp) 

È temp in %xmm0 

: 1  cel2fahr: 

|. 2 vmulsd .LC2(%rip), %xmm0, %xmm0 Multiply by 1.8 

7 3 vaddsd .LC3(%rip), %xmm0, %xmmO = Add 32.0 

a 4 ret 

4 5 .LC2: 

if 6 .long 3435973837 Low-order 4 bytes of 1.8 
q 7 ‘long 1073532108 High-order 4 bytes of 1.8 
i & LOS: i 

J 9 long 0 Low-order 4 bytes of 32.0 
[ 10 .long 1077936128 High-order 4 bytes of 32.0 





We see that the function reads the value 1.8 from the memory location labeled 
.LC2 and the value 32.0 from the memory location labeled- :LC3. Looking at the 
values associated with these labels, we see that each is specified by a pair of . Long 
declarations with the values'given in decimal. How should these be interpreted 
| as floating-point values? Looking at the declaration labeled .LC2, we see that the 
: two values are 3435973837 (0xcccccccd) and 1073532108 (O0x3ffccccc.) Since 
| the machine uses little-endian byte ordering, the first value gives the low-order 4 
i bytes, while the second gives the high-order 4 bytes. From the high-order bytes, 
| we can extract an exponent field of 0x3ff (1023), from which we subtract a bias of 
1023 to get an exponent of 0. Concatenating the fraction bits of the two values, we 
i get a fraction field of Oxccccccecccccd, which can be shown to be the fractional 
binary representation of 0.8, to which we add the implied leading one to get 1.8. 
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Single Double X Effect Description 
vxorps . xorpd D « $578, Bitwise EXCLUSIVE-OR 
vandps andpd D — S&S, Bitwise AND 


Figure 3.50 Bitwise operations on packed data. These instructions perform Boolean 
operations on all 128 bits in an XMM register. 





Show how iis Sumber declared. at qug LC3 encode the number 32.0. 





a 


3.11.5 Using Bitwise Operations in Floating-Point Code 


At times, we find ccc generating code that performs bitwise operations on XMM 
registers to implement useful floating-point results, Figure 3.50 shows some rele- 
vant instructions, similar to their counterparts for operating on general-purpose 
registers. These operations all‘act on packed data, meaning that they update the 
entire destination XMM register, applying the bitwise operation to all the data in 
the two source registers. Once again, our only interest for scalar data is the effect 
these instructions have on the low-order 4 or 8 bytes of the destination. These op- 
erations are often simple and convenient ways to manipulate floating-point values, 
as is explored in the following problem. 





Consider the following C function, wlisre EXPR i is a macro > defined with #define: 


double simplefun(double x) { 
return EXPR(x); 
} 


Below, we show the AVX2 code generated for different definitions of EXPR, 
where value x is held in %xmmO. All of them correspond to some useful operation on 
floating-point values. Identify what the operations are. Your answers will require 
you to understand the bit patterns of the constant words being retrieved from 


memory. 
A. 1 vmovsd .LC1(%rip), %xmmi 

2 vandpd %xmm1, %xmmO, %xmmO 
3 -LC1: 
4 -long 4294967295 
5 long 2147483647 
6 -long 0 
7 -long 0 


B. 1 vxorpd  ZxmmO, %xmm0, %xmmO 








C. 1 vmovsd .LC2(%rip), %xmmi 
| 2 vxorpd "xmm1, %xmm0, XxmmO 
HE 3 .LC2: 
4 -long 0 
P. 5 .long -2147483648 
oe i 6 -long 0 
| 7 -long 0 
Í 
i} 


3.11.6 Floating-Point Comparison Operations 


AVX2 provides two instructions for comparing floating-point values: 


Instruction Basedon Description 
ucomiss $% . $—5i Compare single precision 
ucomisd 51,5, . $;—5, Compare double precision 


306 Chapter 3 Machine-Level Representation of Programs 
These instructions are similar to the cmr instructions (see Section 3.6), in that 

they compare operands 5; and S; (but in the opposite order one might expect) and 

set the condition codes to indicate their relative values. As with cmpq, they follow 

the ATT- format convention of listing:the'operands in reverse order. Argument 

S2 must Be.in an XMM register, while S, can be either in an XMM register or in 


j | memory. 

| The floating-point comparison instructions:set three condition codes: the zero 
flag ZF, the carry flag CF, and the parity flag PF. We did not document the parity 
flag in Section 3.6.1, because it is not commonly found in Gcc-generated x86 code. 
For integer operations, this flag is set when the most recent arithmetic or logical 
operation yielded a valué where the least significant byte'has ‘even parity (i.e., 
an even number of ones in the byte). For floating-point comparisons, however, 
the flag is set when either operand is NaN. By convention, any comparison in C 
is considered to fail when one of the arguments is NaN, and this flag is used to 
detect such a condition. For example, even the comparison x == x yields 0 when x 
is NaN. 

"The condition codes are set as follows: 


ER 


Ordering S45; CF ZF PF 
Unordered el 1 
Sp < Sy 1 0 
$5251 0 1 
$5 > Sy 0 0 


oo oF 


The unordered case occurs when either operand is NaN. This can be detected 
with the parity flag. Commonly, the jp (for “jump on parity”) instruction is used to 
conditionally jump when a floating-point comparison yields an unordered result. 
Except for this case, the values of the carry and zero flags are the same as those 
for an unsigned comparison: ZF is set when the two operands are equal, and CF is 





(a) C code 
typedef enum (NEG, ZERO, POS, OTHER) range t; 


range t find range(float x) 


{ 
int result; 
if (x < 0) 
result = NEG; 
else if (x == 0) 
result = ZERO; 
else if (x > 0) 
result = POS; 
else 
result = OTHER; 
return result; 
} 


(b) Generated assembly code 


range.t find_range(float x) 


Section 3.11 


Floating-Point Code 


: 0 


result += 2 (POS for > 0, OTHER for NaN) 


x, in 4xmm0 4 : 
1 find range: 
2 vxorps %xmmi, %xmmi, Axmmi Set Éxumi = 0 
3 vucomiss AxmmO, %xmmt Compare Q:x 
4 ja .L5 : If >, goto neg 
5 vucomiss %xmmi, %xmmO Compare x:0 
6 jp .L8 If NaN, goto posornan 
7 movi $1, Zeax result ~ ZERO 
8 je .L3 If =, goto done 
9 .L8: posornan: 
10 vucomiss -LCOC%rip) , "mmo Compare x:0 
n setbe hal Set result = NaN ? 1 
12 movzbl Jal, Xeax Zero-exténd 
13 addl $2, eax 
14 ret Return 
15 .Lb: neg: 
16 movi $0, Xeax result = NEG 
17 L3: done: 
; 18 rep; ret Return 


j Figure 3.51 illustration of conditional branching in floating-point code. 
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set when S5 < S,. Instructions such as ja and jb are used to conditionally jump on 
various combinations of these flags. 

As an example of floating-point comparisons, the C function of Figure 3.51(à) 
classifies argument x according to its relation to 0.0, returning an enumerated type 
as the result. Enumerated types in C are encoded as integers, and so the possible 
function values are: 0 (NEG), 1 (ZERO), 2 (POS), and 3 (OTHER). This final outcome 
occurs when the value of x is NaN. 

Gcc generates the code shown in Figure 3.51(b) for find range. The code 
is not very efficient—it compares x to 0.0 three times, even though the required 
information could be obtained with a single comparison. It also generates floating- 
point constant 0.0 twice—once using vxorps, and once by reading the value from 
memory. Let us trace the flow of the function for the four possible comparison 
results: 


x<0.0 The jabranch on line 4 will be taken, jumping to the end with a return 
value of 0. 


x=0.0 The ja (line 4) and jp (line 6) branches will not be taken, but the je 
branch (line 8) will, returning with %eax equal to 1. 


x>0.0 None of the three branches will be taken. The setbe (line 11) will yield 
d; and this will be incremented by the add1 instruction (line 13) to give a 
return value of 2. 


x= NaN The jp branch (line 6) will be taken. The third vucomiss instruction 
(line 10) will set both the carry and the zero flag, and so the setbe 
instruction (line 11) and the following instruction will set 4eax to 1. This 
gets incremented by the add1 instruction (line 13) to give a return value 
of 3. 


In Homework Problems 3.73 and 3.74, you are challenged to hand-generate 
more efficient implementations of find, range. 





Funcion funct3 has the ilow biotdtvbe: 


double funct3(int *ap, double b, long c, float *dp); 
For this function, Gcc generates the following code: 


double funct3(int *ap, double b, long c, float *dp) 
ap in 4rdi, b in %xmm0, c in %rsi, dp „in žrdx 


1 funct3: 

2 vmovss (%rdx), %xmmi 

3 vcvtsi2sd (Ardi), %xmm2, %xmm2 
4 vucomisd Axmm2, "xmmO 

5 jbe .L8 

6 vcvtsi2ssq 4rsi, %xmmO, %xmmO 

7 vmulss %xmmi, %xmmO, Axmmt 








8. vunpcklps Axmmi, %xmmi, %xmm1 
9 vcytps2pd dxmmi, %xmmO 

10 ret 

n .L8: 

12 vaddss  Axmmi, %xmm1, %xmmí 

13 vcvtsi2ssq “rsi, %xmm0, %xmm0 
14 vaddss %xmmi, %xmm0, %xmmO 

15 vunpcklps X4xmmO, %xmm0, XxmmO 
16 vevtps2pd ^xmmO, %xmm0 

17 ret. 


Write a C version of funct3. 


ÁÁ LLLA 


3.11.7 Observations about Floating-Point Code 


We see that the general style of machine code generated for operating on floating- 
point data with AVX2 is similar to what we have seen for operating on integer data. 
Both use a collection of registers to hold and operate on values, and they use these 
registers for passing function arguments. 

Ofcourse, there are many complexities in dealing with the different data types 
and the rules for evaluating expressions containing a mixture of data types, and 
AVX2 code involves many more different instructions and formats than is usually 
seen with functions that perform only integer arithmetic. 

AVX2 also has the poteritial to make computations run faster by performing 
parallel operations on packed data. Compiler developers are working on automat- 
ing the conversion of scalar codé to parallel code, but currently the most reliable 
way to achieve higher performance through parallelism is to use the extensions to 
the C language supported by ccc for manipulating vectors of data. See Web Aside 
OPT:SIMD on page 546 to see how this can be done. 


3.12 Summary 


In this chapter, we have peered beneath the layer of abstraction provided by the 
C language to get a view of machine-level programming. By having the compiler 
generate an assembly-code representation of the machine-level program, we gain 
insights into both the compiler and its optimization capabilities, along with the mas 
chine, its data types, and its instruction set. In Chapter 5, we will see that knowing 
the characteristics of a compiler can help when trying to write programs that have 
efficient mappings onto the machine. We have also gotten a more complete picture 
of how the program stores data in different memory regions. In Chapter 12, we 
will see many examples where application programmers need to know whether 
a program variable is on the run-time stack, in some dynamically allocated data 
structure, or part of the global program data. Understanding how programs map 
onto machines makes it easier to understand the differences between these kinds 
of storage. 
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Machine-level programs, and their representation by assembly code, differ 
in many ways from C programs. There is minimal distinction between different 
data types. The program is expressed as a sequence of instructions, each of which 
performs a single operation. Parts of the program state, such as registers and the 
run-time stack, are directly visible to the programmer. Only low-level operations 
are provided to support data manipulation and program control. The compiler 
must use multiple instructions to generate and operate on different data structures 
and to implement control constructs such as conditionals, loops, and procedures. 
We have covered many different aspects of C and how it gets compiled. We 
have seen that the lack of bounds checking in C makes many programs prone to 
buffer overflows. This has made many systems vulnerable to attacks by malicious 
intruders, although recent safeguards provided by the run-time system and the 
compiler help make programs more secure. 

We have only examined the mapping of C onto x86-64, but much of what we 
have covered is handled in a similar way for other combinations of language and 
machine. For example, compiling C++ is very similar to compiling C. In fact, early 
implementations of C++ first performed a source-to-source conversion from C++ 
to Cand generated object code by running a C compiler on the result. C++ objects 
are represented by structures, similar to a C struct. Methods are represented by 
pointers to the code implementing the methods. By contrast, Java is implemented 
in an entirely different fashion. The object code of Java is a specia] bipary repre- 
sentation known as Java byte code. This code can be viewed as a’machine-level 
program for a virtual machine. As its name suggests, this machine is not imple- 
mented directly in hardware. Instead, software interpreters process the byte code, 
simulating the behavior of the virtual machine. Alternatively, an approach Known 
as just-in-time compilation dynamically translates byte code sequences into ma- 
chine instructions. This approach provides faster execution when code is executed 
multiple times, such as in loops. The advantage of using byte code as the low-level 
representation of a program is that the same code can be "executed" on many 
different machines, whereas the machine code we have considered runs only on 
x86-64 machines. 


Bibliographic Notes 


Both Intel and AMD provide extensive documentation on their processors. This 
includes general descriptions ‘of an assembly-language programmer's view of the 
hardware [2, 50], as well as detailed references about the'individual instruc- 
tions [3, 51]. Reading the instruction descriptions is complicated by the facts that 
(1) all documentation is based on the Intel assembly-code format, (2) there are 
many variations for each instruction due to the different addressing and execution 
modes, and (3) there are no illustrative examples. Still, these remain the authori- 
tative references about the behavior of each instruction. 

The organization x86-64.org has been responsible for defining the application 
binary interface (ABI) for x86-64 code running on Linux systems [77]. This inter- 
face describes details for procedure linkages;.binary code files, and a numberof 
other features that are required for machine-code programs to execute properly. 
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As we have discussed, the ATT format used by acc is very different from the 
Intel format used in Intel documentation and by other compilers (including the 
Microsoft compilers), 

Muchnick’s book on compiler design [80] is considered the most comprehen- 
sive reference on code-optimization techniques. It covers many of the techniques 
we discuss here, such as register usage conventions. 

Much has been written about the use of buffer overflow to attack systems over 
the Internet. Detailed analyses of the 1988 Internet worm have been published 
by Spafford [105] as well as by members of the team at MIT who helped stop its 
spread (35). Since then a number of papers and projects have generated ways both 
to create and to prevent buffer overflow attacks, Seacord’s book [97] provides a 
wealth of information about buffer overflow and other attacks on code generated 
by C compilers. 


Homework Problems 


3.58 € 
For a function with prototype 


long decode2(long x, long y, long z); 


Gcc generates the following assembly code: 


1 decode2: 

2 subq Zrdx, Arsi 
3 imulq %rsi, Ardi 
4 novg rsi, trax 
5 salq $63, A4rax 
6 garg $63, Arax 
7 xorg žrdi, trax 
8 ret 


Parameters x, y, and z are passed in registers %rdi, “rsi, and %rdx. The code 
stores the return value in register 4rax. 

Write C code for decode2 that will have an effect equivalent to the assembly 
code shown. 


3.59 9€ 
The following code computes the 128-bit product of two 64-bit signed values x and 
y and stores the result in memory: 


1 typedef __int128 int128_t; 

2 

3 void store_prod(inti28_t «dest, int64 t x, int64 t y) { 
4 *dest = x * (inti28 t) y; 

5 


+ 


Gcc generates the following assembly code implementing the computation: 
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store prod: 


movq 
cqto 
movq 
sarq 
imulq 
imulq 
addq 
mulq 
addq 
movq' 
movq 
ret 


Jrdx, Wrax 


YXrsi, Arex 
$63, %roxs 
Y%rax, Arcx 
Yrsi, Ardx 
Yrdx, “Arcx 
%rsi 

Wrcx, Ardx 
"rax, (hrdi) 
Xrdx, 8(Ardi) í 


This code uses three multiplications for the multiprecision arithmetic required 
to implement 128-bit arithmetic on a 64-bit machine. Describe the algorithm üsed 
to compute the product, and annotate the assembly code to show how it realizes 
your algorithm. Hint: When extending arguments of x and y to 128 bits, they can 
be rewritten as x =2™ - x, + x; and y= 264 . y, + yj, Where xj, Xp Yn» and y; are 64- 
bit values. Similarly, the 128-bit product canbe written as p = 294 - py, pi, where 
py and p; are 64-bit values. Show how the code computes the values of p, and p; 
in terms of x, xj, Ya, and yr. 


3.60 99 


Consider the following assembly code-- 


long loop(long x, int n) 


x in %rdi, n in “esi 


loop: 
movi 
movl 
movl 
jmp 

.L3: 
movq 
andq 
orq 
salq 

.L2: 
testq 
jne 


Jesi, 4ecx 
$1, %edx 
$0, Leay 
L2 


4rdi, 4r8 
Xrdx, 4r8 
4r8, Wrax 
%cl, Wrdx 


q$rdx, WArdx 
.L3 


rep; ret 


The preceding code was generated by compiling C code that had th 
overall form: 


e following 
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long loop(long x, long n) 
{ 
long result = __ 
long mask; 
for (mask = ; mask ... .. ; mask = 
result [= Os; 
} 


return result; 


O 00 0 0 X 0 NS = 


} 


Your task is to fill in the missing parts of the C code to get a program equivalent 
to the generated assembly code. Recall that the result of the function is returned 
in register rax. You will find it helpful to examine the assembly code before, 
during, and after the loop to form a consistent mapping between the registers and 
the program variables. 


A. Which registers hold program values x, n, result, and mask? 


. What are the initial values of result and mask? 


. How does mask get updated? 


B 

C. What is the test condition for mask? 
D 

E 


. How does result get updated? 
Fill in all the missing parts of the C code. 


3.61 99 
In Section 3.6.6, we examined the following code as a candidate for the use of 
conditional data transfer: 


long cread(long *xp) { 
return (xp ? *xp : 0); 


) 


We showed a trial implementation using a conditional move instruction but argued 
that it was not,valid, since it could attempt to read from a null address. 

Write a C function cread_alt that has the same behavior as cread, except 
that it can be compiled to use conditional data transfer. When compiled, the 
generated code should use a conditional move instruction rather than one of the 
jump instructions. 


3.62 99 

The code that follows shows an example of branching on an enumerated type 
value in a switch statement. Recall that enumerated types in C are simply a way 
to introduce a set of names having associated integer values. By default, the values 
assigned to the names count from zero upward. In our code, the actions associated 
with the different case labels have been omitted. 
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/* Enumerated type creates set of constants numbered 0 and upward */ 
typedef enum {MODE_A, MODE_B, MODE_C, MODE_D, MODE_E} mode_t; 


1 
2 

3 

4 long switch3(long *pi, long *p2, mode_t action) 
s fí 

6 long result = 0; 

7 

8 

9 


switch(action) { 
case MODE_A: 


case MODE_B: 


case MODE_C: 
i 











"case MODE D: 


case MODE E: 


default: 









} 


return result; 


M 














The part of the generated assembly code implementing the differentactions is 
shown in Figure 3.52. The annotations indicate the argument locations, the register 
values, and the case labels for the different jump destinations. 

Fill in the missing parts of the C code. It contained one case that fell through 


to another—try to reconstruct this. ` 














3.63 99 
This problem will give you a chance'to reverse éngineer a switch stát&ment from | 


disassembled machine code. In the folldwing procedure, the'body ofthe switch 
statement has been omitted:": 


" 1 ' 


` 


long switch_prob(long x, long 2 { 
long result = x; 
switch(n) 1 

/* Fill in code here */ 






} 


return result; 
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pl in &rdi, p2 in %rsi, action in “edx 
.L8: MODE E 
movl $27, %eax 
ret 
.L3: 
movq (Arsi), %rax 
movq (žrdi), %rdx 
movq Ardx, (Arsi) 
ret 
.l5: 
movq Cardi), %rax 
addq (4rsi), %rax 
movq 4rax, (%rdi) 
ret 
.L6: 
movq $59, (%rdi) 
movq (4rsi), %rax 
ret 
L7: 
movq (rsi), %rax 
movq Arax, (Ardi) 
movl $27, %eax 
ret 
.L9: default 
movl $12, eax 
ret 


$ ON Dw d wN — 


| Figure 3.52 Assembly code for Problem 3.62. This code implements the different 
[branches of a switch statement. 


$ Figure 3.53 shows the disassembled machine code for the procedure. 

The jump table resides in a different area of memory. We can see from 
E the indirect jump on line 5 that the jump table begins at address 0x4006¢8. 
| Using the cpp debugger, we can examine the six 8-byte words of memory compris- 
| ing the jump table with the command x/6gx 0x4006£8. Gps prints the following: 


E (gdb) x/6gx Ox4006f8 

| 0x4006£8: ^ 0x00000000004005a1 0x00000000004005c3 
f 0x400708:  0x00000000004005a1 0x00000000004005aa 
B 0x400718:  0x00000000004005b2 ^ 0x00000000004005bf 


Fill in the body of the switch statement with C code that will have the same 
[ behavior as the machine code. 
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long switch_prob(long x, long n) 
x in &rdi, n in Zrsi 
0000000000400590 «switch, prob»: 
400590: 48 83 ee 3c sub $0x3c ,%rsi 
400594: 48 83 fe 05 cmp $0x5,%rsi 
400598: 77 29 ja 4005c3 <switch_probt0x33> 
40059a: ff 24 f5 f8 06 40 OO jmpa “*0x4006£8(,%rsi,8) 
4005a1: 48 8d 04 fd 00 00 00 lea 0x0(,%rdi,8) ,%rax 
4005a8: 00 
4005a9: c3 retq 
4005aa: 48 89 f8 mov ordi ,%rax 
4005ad: 48 ci f8 sar $0x3,Xrax 
4005bi: c3 retq 
4005b2: 48 89 f8 mov 4rdi,A4rax 
4005b5: 48 c1 eO shl $0x4, rax 
4005b9: 48 29 £8 sub Vrdi,4rax 
4005bc: 48 89 c7 mov Zrax,Ardi 
4005bf: 48 Of af ff imul Ardi, ,%rdi 
4005c3: 48 8d 47 4b lea Ox4b(%rdi) ,%rax 
4005c7: c3 retq 


Figure 3.53 Disassembled code for Problem 3.63. 


3.64 999 
Consider the following source code, where R, S, and T are constants declared with 
#define: 


long A [RJ [S] LT] ; 


1 
2 

3 long store ele(long i, long j, long k, long *dest) 
4 t 

5 alee 3 
6 

7 


*dest = ALi] [j] [k]; 
return sizéof (A); 


E] 


} 


te 


In-conipiling this program, acc generates the following assembly codéfiu' 


long store_ele(long i, long j, long k, long *dest) 
i in /rdi, j in %rsġ; k in frdx, dest in Arex” 
store, ele: "E - 

leaq (Arsi,4rsi,2), Xrax ut 

leaq Cirsi,%rax,4), Wrax 

movq Vrdi; 4rsi c! 

salq $6, %rsi 

addq ‘rei, ^rdi 

addq Yrax, 4rdi 
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8 addq tdi, %rdx 

9 movq — :AC, Ardx,8), %rax 

10 movq rax, (%rcx) 

11 movl $3640, %eax 

12 ret f 


A. Extend Equation 3.1 from two dimensions to three to provide a formula for 
the location of array element A [i] [j] [x]. 


B. Use your reverse éngineering skills to determine the values of R, S, and T 
based on the assembly code. 


3.65 € 
The following code transposes the elements of an M x M array, where M is a 
constant defined by #define: 


1 void transpose(long A[M](M]) { 

2 long i, j; 

3 for (20; i < M; i++) 

4 for (j = 0; j < i; j++) i 
5 long t = A[i] Lj]; 

6 ATi] [j] = ACj] [i]; 

7 A[j] [i] = t; 

8 

9 


} 


When compiled with optimization level -01, cc generates the following code 
for the inner loop of the function: 


.L6: 
movq (Ardx), %rcx 
movq (rax), %rsi 
movq 4rsi, Cirdx) 
movq Arcx, (%rax) 


addq $8, %rdx 
addq $120, %rax 
cmpq žrdi, %rax 
jne .L6 


^ 4 


We can see that'Gcé has converted the array indexing to pointer code. 


wren DA (A i 0 NN — 


A. Which register holds a pointer to array elénient ATA jl? 
B. Which register holds a pointer to array element A [j] [i]? 
C. What is the value of M? 


| 3664 

[ Consider the following source code, where NR and NC are macro expressions de- 
; lared with #define that compute the dimensions of array A in terms of parame- 
} tern. This code computes the sum of the elements of column j of the array. 
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1 long sum col(long n, long AINR(n)] (NC(n)] > long j f 
2 long i; Xo 

3 long result = 0; 

4 for (i = 0; i < NRG); i++) 

5 result += ALi] [j]; 

6 return result; 

7 


} 


In compiling this program, GCC generates the following assembly code: 
ed À 
long sum_col(long n, long A{NR(n)] [NC(n)], long j) ? ^ 
n in Zrdi, A in grsi, j in {rdx 
sum_col: 
leaq 1(,%rdi,4), %r8 
leaq (Yrdi,%rdi,2), %rax 
movq Xrax, Ardi 
testq "rax, %rax 
jle .L4 
salq $3, 4r8 
leaq (4rsi,X4rdx,8), Arex 
movi $0, eax 
movi $0, edx 
.L3: 
addq (rcx), ~Arax 
addq, $1, 4rdx 
addq 4x8, Wrcx 
cmpq Xrdi, Ardx 
jne .L3 
rep; ret 
.L4: 
movi $0, 4eax 
ret 


t 
Use your reverse engineering skills to determine the definitions of NR and NC. _ 


3.67 99 

For this exercise, we will examine the code generated by ccc for functions that have 

structures as arguments and return values, and from this see how tHese language | 

features are typically implemented. -— š i 
The following C code has a function'process having structures as argument 4 

and return values, and a function eval that talls pr cess! EE 


typedef struct i 
long a[2]; 
long *p;! s 

) strå; 


4 
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typedef struct { 
long u[2]; 
long q; 

} strB; 


StrB process(strA s) { 
StrB r; 
r.u[0] = s.aT11; 
r.u[1] = s.a[0]; 
r.q = *S.p; 
return r; 


long eval(long x, long y, long z) { 
strå s; 
s.a[0] = x; 
s.a[1] = y; 
S.p = &z; 
strB r = process(s); 
return r.u[0] + r.uf[i] + r.q; 


} 


Gcc generates the following code for these two functions: 


strB process(strA s) 

process: 
movg žrdi, %rax 
inovq 24(%rsp), %rdx 
movq Cirdx), %rdx 
movq 16(Xrsp), trex 
movq rcx, (%rdi) 
movq 8(4rsp), %rcx 
movq 4rcx, 8(%rdi) 
novq %rdx, 16(Xrdi) 
ret 


long eval(long x, long y, long z) 
x in Ardi, y in Zrsi, z in %Yrax 
eval: 
subq $104, %rsp 
movq Ardx, 24(%rsp) 
leaq 24(%rsp), %rax 
movq žrdi, (%rsp) 
movq  %rsi, 8(%rsp) 
movq rax, 16(%rsp) 
leaq 64Chrsp), %rdi 
call process 
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10 movq 72(%rsp), Xrax 
11 addq 64(%rsp), %rax 
12 addq 80(%rsp), %rax 
13 addq $104, %rsp 

14 ret 


. We can see on line 2 of function eval that it allocates 104 bytes on the stack. 
Diagram the stack frame for eval, showing the;values that it stores on the 
stack prior to calling process. 


. What value does eval pass in its call to process? 
. How doesthe code for process access the elements of structure argument s? 
. How does the code for process set the fields of result structure r? 


. Complete your diagram of the stack frame for eval, showing how eval 
accesses the elements of structure r following the return from process. 


What general principles can you discern about how structure values are 
passed as function arguments and how they are returned as function results? 


3.68 999 
In the following code, A and B are constants defined with stdef ine: 


typedef struct ( 
int x[A] [B]; /* Unknown constants A and B */ 
long y; 

} stri; 


char array[B]; 
int t; 
short s[A]; 
long u; 

} str2; 


1 

2 

3 

4 

5 

6 typedef struct { 
: : 

8 

9 

0 


void setVal(stri *p, str2 *q) ( 
long vi = q-^t; 
long v2 = q->u; 
16 py = vitv2; 
i7 Jj 


Gcc generates the following code for setVal: 


void setVal(stri *p, str2 *q) 
p in žrdi, q in &rsi 
setVal: 
movslq 8(%rsi), %rax 
addq 32(%rsi), trax 
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movq 4rax, 184 (rdi) 


What are the values of A and B? (The solution is unique.) 


3.69 99 
You are charged with maintaining a large C program, and you come across the 
following code: 


typedef struct 1 
int first; 
a struct a[CNT]; 
int last; 

} b. struct; 


void test(long i, b struct *bp) 

1 
int n = bp-»first + bp->last; 
a struct *ap = &bp-^a[i]; 
ap->x[ap->idx] = n; 


The declarations of the compile-time constant CNT and the structure a_struct 
are in a file for which you do not have the necessary access privilege. Fortunately, 
you have a copy of the .o version of code, which you are able to disassemble with 
the oBJDUMP program, yielding the following disassembly: 


void test(long i, b_struct *bp) 

i in Zrdi, bp in %rsi 

0000000000000000 «test»: 
8b 8e 20 01 00 00 mov 0x120 (Arsi),Zecx 
03 0e add (hrsi) ,%ecx 
48 8d 04 bf lea (hrdi ,%rdi,4) ,%rax 
48 8d 04 c6 lea (rsi, %rax,8) , %rax 
48 8b 50 08 mov 0x8 (%rax) ,%rdx 
48 63 c9 movslq %ecx,%rcx 
48 89 4c dO 10 mov Arcx, 0x10 (%rax,%rdx,8)}) 
c3 retq 


Using your reverse engineering skills, deduce the following: 
A. The value of CNT. 


B. A complete declaration of structure a, struct. Assume that the only fields 
in this structure are idx and x, and that both of these contain signed values. 
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3.70 0% 
Consider the following union declaration: 


1 union ele { 

2 struct { 

3 long *p; 

4 long y; 

5 } ei; 

6 struct ( 

7 long x; 

8 union ele *next; 
9 ) e2; 

10 3 


This declaration illustrates that structures can be embedded within unions. 
The following function (with some expressions omitted) operates on a linked 
list having these unions as list elements: i 


1 void proc (union ele *up) { 
2 up-> zx LL) 7 Li 


ES; 
A. What are the offsets (in bytes) of the following fields: 


el.p 

el Jy 

e2.x Pet. 
e2.next b 


B. How many total bytes does the structure require? 
C. The compiler generátes the following assembly code for proc: 


void proc (union ele *up) 
up in %rdi 
8(Xrdi), %rax 
(%rax), Wrdx 
(rdx), Wrdx 
8(%rax), Ardx 
Yrdx, (rdi) 


E 


On the basis of this inforniation, fill in the missing expressions in the code 
for proc. Hint: Some union references can havé ambiguous interpretations. ; 
These ambiguities get resolved as you see where the references lead. There | 
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is only one answer that does not perform any casting and does not violate 
any type constraints. 


3719 ; 

Write a function good, echo that reads a line from standard input and writes it to 
standard output. Your implementation should work for an input line of arbitrary 
length." You may use the library function fgets, but you must make sure your 
function works correctly even when the input line requires more space than you 
have allocated for your buffer. Your code should also check for error conditions 
and return when one is encountered. Refer to the definitions of the standard I/O 
functions for documentation [45, 61]. 


372 99 

Figure 3.54(a) shows the code for a function that is similar to function vfunct 
(Figure 3.43(a)). We used vfunct to illustrate the use of a frame pointer in man- 
aging variable-size stack frames. The new function aframe allocates space for local 


(a) C code 


1 
finclude <alloca.h> 


long aframe(long n, long idx, long *q) ( 
long i; 


plo] = &i; 
for (i = 1; i < n; i++) 
plil] = q; 


1 
2 
3 
4 
5 long **p = alloca(n * sizeof(long *)); 
6 
7 
8 
9 yeturn *p[idx]; 


0 ) 
(b) Portions of generated assembly code 


long aframe(long n, long idx, long *q) 

n in frdi, idx in Zrsi, q in Xrdx 

aframe: 

pushq — 4rbp 

movq Arsp, %rbp 

subq $16, Zrsp Allocate space for i (rsp = sı) 
leaq 30€, rdi,8), %rax 

andq $-16, %rax 

subq “vax, %rsp Allocate space for array p (%rsp = sj) 
leaq 15(%rsp), %r8 

andq $-16, 7x8 Set Zr8 to &p[0] 


r 


i Figure 3.54 Code for Problem 3.72. This function is similar to that of Figure 3.43. 
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array p by calling library function alloca. This function is similar to the more com- 
monly used function malloc, except that it allocates space on the runrtime stack. 
The space is automatically deallocated when the executing procedure returns. 

Figure 3.54(b) shows the part of the assembly code that sets up therframe 
pointer and allocątes space for local variables i and p. It'is very similar. to the 
corresponding tode:for vframe. Let us use:the same notation as in Problem 3.49: 
The stack pointer is set to values s; at liné 4 and s; at line 7. The start-address of 
array p is set to value p at line 9, Extra space e; may arise between s? and p, and 
extra space e; may-arise between the end of array p and sj. 


A. Explain, in mathematical terms, the logic in the computation of $2. 

B. Explain, in mathematical terms, the logic in the computation of p. 

C. Find values of n and s; that lead to minimum and maximum values of e. 
D. 


What alignment properties does this codé guarantee for the values of s; 
and p? ' ‘ " 


3.73 © 

Write a function in assembly code that matches the behavior of the function find. 
range in Figure 3.51. Your code should contain only one floating-point comparison 
instruction, and then it should use conditional branches to generate the correct 
result. Test your code on all 2° possible argument values. Web Asidé ASM:EASM 
on page 178 describes how to incorporate functions written in assembly code into 
C programs. 


3.74 o@ x 

Write a function in assembly code that matches the behavior of the function find. 
range in Figure 3.51. Your code should contain only one floating-point comparison 
instruction, and then it should use conditional moves to generate the correct result. 
You might want to make use of the instruction cmovp (move'if even parity). Test 
your code on all 23? possible argument values. Web Aside ASM:EASM on page 178 
describes how to incorporate functions written in assembly code into C programs, 


3.75 9 

ISO C99 includes extensions to support complex numbers. Any floating-point type 
can be modified with the keyword complex. Here are some sample functions that 
work with complex data and that call some of the associated library-functions: 
#include <complex.h> 

double c_imag(double complex x) { 

return cimag(x) ; 


double c_real(double complex x) i 
return creal(x); 


1 
2 
3 
4 
5 } 
6 
7 
8 
9 


10 d 
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11 double: complex c. sub(double complex x, double. complex y) { 
12 7 return x-y; me 
3 ) 


When compiled, ccc generates the following assembly code for these func- 
tions: 


double c imag(double complex x) 


1 c imag: 
2 movapd %xmmni, %xmm0 
ret 


double c real(double complex x) 

4 c_real: 
5 rep; ret 

1 
double complex c.sub(double complex x, double complex y) 
Cc. sub: 

subsd %xmm2, %xmmO 

subsd %xmm3, %xmm1 

ret 


v 0 N a 


Based on these examples, determine the following: 


Á. How are complex arguments passed to a function? 
B. How are complex values returned from a function? 


Solutions to Practice Problems 


Solution to Problem 3.1 (page 182) 
This exercise gjves you practice with the different operand forms. 


Operand Value Comment 

žrax 0x100 Register 

0x104 OxAB Absolute address 
$0x108 0x108 Immediate 
(hrax) OxFF Address 0x100 
4(%rax) OxAB Address 0x104 
9(frax rax), 0x11 Address 0x10C 
260(%rcx, %rdx) 0x13 Address 0x108 
OxFC(, %rcx, 4) OxFF Address 0x100 
(Arax,4rdx,4) Ox11 Address 0x10C 


f Solution to Problem 3.2 (page 185) 
p As we have seen; the assembly code generated by ccc includes suffixes on the 
+ instructions, while the disassembler does not. Being able to switch between these 
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i 
5 
dl 












| 
j| 
| 
| 
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movb 
movi 
movw 
movb 
movl 
movl 
movb 


two forms is am important skill to learn. One important feature is that memory 
references in x86-64 are always given with quad word registers,-such as {rax, even 
if the operand is a byte, single word, or double word. 

Here is the code written with suffixes: 


stk * 20 


movl %eax, (%rsp) 

movw — (Arax), %dx 

movb $OxFF, Xbl 

movb (&rsp,%rdx,4), %dl 
movq (%rdx), %rax 

movu %dx, (%rax) 


Solution to Problem 3.3 (page 186) 

Since we will rely on acc to generate most of our assembly codé, being able to 

write correct assembly code is not a critical skill. Nonetheless, this exercise will 

help you become more familiar with the different instruction and operand types. 
Here is the code with explanations of the errors: 


$0xF, (hebx) Cannot use Zebx as address register 

rax, Chrsp) Mismatch between instruction suffix and register ID 

(%rax) 4C rsp) Cannot haya.both source and destination be memory references 
X21, Asl No register named %s1 

Jeax, $0x123 Cannot have imediate as destination 1 £ 
eax, idx Destination operand !incorrect size 

Asi, 8(%rbp) ‘Mismatch between instruction suffix and register ID 


% x 


Solution to Problem 3.4 (page 187) 

This exercise gives you more experience with the different data movement in- 
structions and hoWw they relate to the data types and convetŝioh rüles of C. The 
nuances of conversions of both signedness and size, as well as integral promotion, 
add challenge to this problem. : 


src t dest t Instruótion — Comments 

long long movq (Ardi), Yrax Read 8 bytes 1i 
movq rax, (Arsi) Store 8 bytes- 

char int movsbi (%fdi), %éax Convert char to inte ™ 
movl “eax, (Arsi) Store 4 bytes 

char unsigned movsbl (%rdi) ,, 4eax Convert char to int 
movl Zeax, (rsi) Store 4 bytes 

unsigned char long movzbl (4rdi) ,.Zeax Redd byte and zero-extend 

2 movg Arax, (4rsi) »Store 8 bytes 


kr ç 
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int char movl (%rdi), Xeax Read 4 bytes 
movb %al, (%rsi) Store low-order byte 
unsigned unsigned movl (%rdi), %eax Read 4 bytes 
char movb 4al, Chrsi) Store low-order byte 
char short movsbw (Ardi), %ax Read byte and sign-extend 
movw 4ax, (4rsi) Store 2 bytes 


Solution to Problem 3.5 (page 189) 

Reverse enginéering is a good way to understand systems. In this case, we want 
to reverse the effect of the C compiler to determine what C code gave rise to this 
assembly code. The best way is to run a “simulation,” starting with values x, y, and 
z at the locations designated by pointers xp, yp, and zp, respectively. We would 
then get the following behavior: 


void decodei(long *xp, long *yp, long *zp) 
xp in žrdi, yp in Xrsi, zp in %rdx 


decode1: 
movq (žrdi), 4r8 Get x = *xp 
movq Cirsi), %rcx Get y = *yp 
movq (%rdx), %rax Get z = *«zp 
movq “x8, Chrsi) Store x at yp 


movq ‘%rcx, (hrdx) Store y at zp 
movq 4rax, (%rdi) Store z at xp 
ret 


From this, we can generate the following C code: 


void decodei(long *xp, long *yp, long *zp) 


{ 
long x = *xp; 
long y = *yp; 
long z = *zp; 
*yp = X 
*Zp = y; 
*xp zi 

} 


Solution to Problem 3.6 (page 192) 

. This exercise demonstrates the versatility of the 1eaq instruction and gives you 
more practice in deciphering the different operand forms. Although the operand 
forms are classified as type “Memory” in Figure 3.3, no memory access occurs. 
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Instruction Result 


ihe L O M EMEN ED DB EE 
leaq 6(4rax), %rdx 6+x 
leaq (%rax,%rcx), %rdx x+y 
leaq (%rax,%rcx,4), Ardx x +4y, 
leaq 7(%rax,%rax,8), krdx 7+9x 
leaq OxA(,%rex,4), Ardx 10+ 4y 
leaq 9(4rax,%rcex,2), %rdx 9-+x+2y 


Solution to Problem 3.7 (page 193) ! 

Again, reverse engineering proves to be a useful way to learn the relationship 

between C code and the genergted assembly code. t 
The best way to solve,problems of this type is to annotate the lines of assembly 


D 


code with informatio about the operations being performed. Here is a sample: 


long scale2(long x, long y, long z) 
x in Xrdi, y in Arsi, z in Zrdx 
scale2: 
leaq (%rdi,%rdi,4), %rax B*x 
leaq (hrax,%rsi,2), %rax 5ext2ey 
leaq (%rax,%rdx,8), %rax 5*xt2*y*B*Zz 
ret 


From this, it is easy to generate the missing expression: 
logt-5*x*2*y*B*z 


Solution to Problem 3.8 (page 194) . 

This problem gives you a chance to test your understanding of operands and the 
arithmetic instructions. The instruction'sequence is designed so that the result'of 
each instruction does not affect the behavior of subsequent ones. 





Instruction Destination Value ` 
SUS epis M ee Re sic I RA UE 


addq %rcx, (4rax) 0x100 0x100 
subq %rdx ,8(4rax) 0x108 0xA8 
imulq $16, (Arax, %rdx, 8) 0x118 0x110 
incq 16 (%rax) 0x110 0x14 
decq Arcx rcx OxO 
subq Ardx, rax 4rax OxFD 


Solution to Problem 3.9 (page 195) 

This exercise gives you a chance to generate a little bit of assembly code. The 
solution code was generated by ccc. By loading parameter n in register %ecx, it 
can then use byte register %c1 to specify the shift amount for the sárq instruction. 
It might seem odd to use a mov] instruction, piven that n is eight bytes.long, but 
keep in.mind that only'the least significant byte is required to specify the shift 
amount. 
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long shift left4 rightn(long x, long n) we 
x in A4rdi, n in Xrsi 

Shift left4 rightn: 
movq Wrdi, %rax Get x 


salq $4, %rax x <= 4 
movl hesi, %ecx Get n (4 bytes) 
sarq Acl, %rax x >= n 


Solution to Problem 3.10 (page 196) 
This problem is fairly straightforward, since the assembly code follows the struc- 
ture of the C code closély. 


long ti = x | y; 
long t2 = t1 >> 3; 
long t3 = ~t2; 


long t4 = z-t3; 


Solution to Problem 3.11 (page 197) 


A. This instruction is used to set register %rdx to zero, exploiting the property 
that x ^ x = 0 for any x. It corresponds to the C statement x = 0. 


B. Amore direct way of setting register %rdx to zero is with the instruction movq 
$0, rdx. 


C. Assembling and disassembling this code, however, we find that the version 
with xorq requires only 3 bytes, while the version with movq requires 7. Other 
ways to set Zrdx to zero rely on the property that any instruction that updates 
the lower 4 bytes will cause the high-order bytes to be set to zero. Thus, we 
could use either xorl %edx, %edx (2 bytes) or movl $0, %edx (5 bytes). 


Solution to Problem 3.12 (page 200) 

We can simply replace, the cqto instruction with one that sets register %rdx to 
zero, and use divq rather than idivq as our division instruction, yielding the 
following code: 


void uremdiv(unsigned long x, unsigned long y, 
unsigned long *qp, unsigned long *rp) 
x in rdi, y in krsi, qp in Xrdx, rp in 4rcx 





1 uremdiv: 

P 2 movq Wrdx, 4r8 Copy gp 
3 movq žrdi, %rax Move x to lower 8 bytes of dividend 
4 movl $0, %edx Set upper 8 bytes of dividend to 0 
5 divq 4rsi Divide by y 
6 movq %rax, (%r8) Store quotient at qp 
7 movq 4rdx, (CArcx) Store remainder at rp 
8 ret 








330 Chapter 3 Machine-Level Representation of Programs 


Solution to Problem 3.13 (page 204) 
It is important to understand that assembly code does not keep track of the type 


of a program value. Instead, the different instructions determine the operand 
sizes and whether they are signed or unsigned. When mapping from instruction 
sequences back to C code, we must do a bit of detective work to infer the data 
types of the program values. 

A. The suffix ‘1’ and the register identifiers indicate 32-bit operands, while the 
comparison is for a two’s-complement <. We can infer that data_t must be 
int. 

B. The suffix ‘w’ and the register identifiers indicate 16-bit operands, while the 
comparison is for a two’s-complement >=. We can infer that data_t must be 
short. | 

C. The suffix ‘b’ and the register identifiers indicate 8-bit operands, while — . 
the comparison is for an unsigned <=. We can infer that data t must be  . | 
unsigned char. 

D. The suffix 'q' and the register identifiers indicate 64-bit operands, while 
the comparison is for !=, which is the same whether the arguments are 
signed, unsigned, or pointers. We can infer that data, t could be either long, 
unsigned long, or some form of pointer. 


Solution to Problem 3.14 (page 205) a 
This problem is similar to Problem 3.13, except that it involves TEST instructions — 24 


rather than cmp instructions, 

A. The suffix ‘q’ and the register identifiers indicate a 64-bit operand, while the | 
comparison is for >=, which must be signed. We can infer that'data, t must 
be long. 

B. The suffix ‘w’ and the register identifier indicate a 16-bit operand, while the | 
comparison is for ==, which is the same for signed or unsigned. We can infer $ 
that data_t must be either short or unsigned short. 

C. The suffix ‘b’ and the register identifier indicate an 8-bit operand, while the 
comparison is for unsigned >. We can infer that data_t must be unsigned | 
char. : 

D. The suffix ‘1’ and the register identifier indicate 32-bit operands, while the 
comparison is for <. We can infer that data_t must be int. 


Solution to Problem 3.15 (page 209) 
This exercise requires you to examine disassembled code in detail and reason 


i about the encodings for jump targets. It also gives you practice in hexadecimal , 

a arithmetic. 

A. The je instruction has as its target 0x4003fc + 0x02. As the original disas- 
sembled code shows, this is 0x4003fe: 





em ee ae 


f 4003fa: 74 02 je 4003fe E 


4003fc: ff dO callq *%rax 
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B. The je instruction has as its target 0x0x400431 — 12 (since Oxf4 is the 1- 
byte two's-complement representation of —12). As the original disassembled 
code shows, this is 0x400425: 


40042f: 74 f4 je 400425 
400431: 5d pop Arbp 


C. According to the annotation produced by the disassembler, the jump target 
is at absolute address 0x400547. According to the byte encoding, this must 
be at an address 0x2 bytes beyond that of the pop instruction. Subtracting 
these gives address 0x400545. Noting that the encoding of the ja instruction 
requires 2 bytes, it must be located at address 0x400543. These are confirmed 
by examining the original disassembly: 


400543: 77, 02 ja 400547 | 
400545: 5d pop Arbp 






D. Reading the bytes in reverse order, we can see that the target offset is 
Oxffffff73, or decimal —141. Adding this to 0x0x4005ed (the address of 
the nop instruction) gives address 0x400560: 







4005e8: e9 73 ff ff ff jmpq 400560 
4005ed: 90 nop 






Solution to Problem 3.16 (page 212) 

Annotating assembly code and writing C code that mimics its control flow are good 
first steps in understanding assembly-language programs. This problem gives you 
practice for an example with simple control flow. It also gives you a chance to 
examine the implementation of logical operations. 


A: Here is the C code: 








void goto cond(long a, long *p) { 
if (p == 0) 
goto done; 
if (*p >= a) 
goto done; 
*p = a; 
done: 
return; 












} 





B. The first conditional branch is part of the implementation of the && expres- 
sion. If the test for p being non-null fails, the code will skip the test of a > *p. 







Solution to Problem 3.17 (page 212) 


This is an exercise to help you think about the idea of a general translation rule 
and how to apply it. 







A. Converting to this alternate form involves only switching around a few lines 
of the code: 
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long gotodiff se alt(long x, long y) { 

long result; 
if (x < y) 

goto x_lt_y; 
ge_cntt+; 
result =x ~- y; 
return result; 

ay Xt. y: 

lt cnttt;, 
result" y- X; 
rgturn result; . 


} fT 1 


. In most respects, the choice is arbitrary. But the original rule-wotks better 
for the.common casé Where theré is no else statement. For this Case, we can 
simply, modify the.translation rule to be, as follows: J 

r f 
t = test-expr; 
if (!t) 
goto done; 
then-statement 
done: 


A tratislation based on the’dlternate rule is more ctifiibersome. 


> EUH 


Solution to Problem 3.18 (page 213) 2 
This problem requires that you work through a nested branch structure, where 
you will see how our rule for translating if statements has been applied. On the 
whole, the machine code is a straightforward translation of the C code. 
long test(long x, long y, long z) { 
long val = xty*z; 
if (x < -3) f 
if (y < z) 
val = x*y; 
else 
val = y*z; 
} else if (x > 2) 
vali x*Z; +H 
return val; 


} 


H 


M 


Solution to Problem 3.19 (page 216) 
This problem reinforces our method of éóomputing the misprediction penalty. 


M ; . " 60r ag suut : a 
A. We can apply our formula directly to get vie -201-16- 30: " s 
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B. When misprediction occurs, the function will require around 16 + 30 = 46 
cycles. 


Solution to Problem 3.20 (page 219) 
This problem provides a chance to study the use of conditional moves. 


A. The operator is ‘/’. We see this is an example of dividing by a power of 3 by 
tight shifting (see Section 2.3.7). Before shifting by k = 3, we must add a bias 
of 2* — 1 = 7 when the dividend is negative. 


. Here is an annotated version of the assembly code: 


long arith(long x) 
x in frdi 
arith: 
leaq 7Chrdi), %rax temp = x47 
testq žrdi, žrdi Text x 
cmovns rdi, %rax If x>= 0, temp = x 
arq $3, %rax result = temp >> 3 (= x/8) 
ret 


ad 
The program creates a temporary value equal to x + 7, in anticipation of x 
being negative and therefore requiring biasing. The cmovns instruction con- 
ditionally changes this number to x when x 0, and then it is shifted by 3 to 
generate x /8. 


Solution to Problem 3.21 (page 219) 

This problem is similar to Problein 3.18, except that some of the conditionals have 

been implemented by conditional data transfers. Althou gh it might seem daunting 
+ to fit this code into the framework of the original C code, you will find that it follows 
} the translation rules fairly closely. 


long, test (long x, long y) { 
long val = 8*x; 
if (y > 0) { 
if (x < y) 
val = y-x; 
else 
val = xky; 
‘} else if (y <= -2) 
Val = x+y; 
return val; 


[3 


] Solution to Problem 3.22 (page 221) 


A. If we build up a table of factorials computed with data type int, we get the 
following: 





Ca a a im Rae T rta S r lisa al pan Tn 
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Q 
A 
ec 


+ 


Zo oH ox ox xx x Mx x xx 


720 

5,040 

40,320 
362,880 
3,628,800 
39,916,800 
479,001,600 
1,932,053,504 


v: 0-321000 AP WN PIS 


We can see that the computation of 13! has overflowed. As we:learned in 
Problem 2.35, when we get value x while attempting to compute n!, we can 
test for overflow by computing x/n and seeing whether it equals (n — 1)! 
(assútńing that we have dlready ensured that the computation of (n — 1)! did 
not overflow). In this case we get 1,932,053 ,504/13 — = 161,004,458.667. As a 
second test, we can see that any factorial beyorid 10! must be a multiple of 
100 and therefore have zeros for the last two digits. The correct value of 13! 
is 6,227,020,800. 


. Doing the computation with data type long lets us go up to 20!, yielding 
2,432,902,008,176,640,000: 


Solution to Problem 3.23 (page 222) 

The code generated when compiling loops can be tricky to analyze, because the 
compiler can perform many different optimizations on loop code, and because it 
can be difficult to match program variables with registers. This particular example 
demonstrates several places where the assembly code is not just a direct translation 
of the C code. 


A. Although parameter x is passed to the function in register Zrdi, we can see 
that the register is never referenced once the loop is entered. Instead, we 
can see that registers %rax, (rcx, and %rdx are initialized in lines 2-5: to x, 
x*x, and x*x. We can conclude, therefore, that these registers contain the 
program variables. 


. The compiler determines that pointer p always points to x, and hence the 
expression (*p)++ simply increments x. It combines this incrementing by 1 
with the increment by y, via the 1eaq instruction of line 7. 


C. The annotated code is as follows: ' 
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long dw loop(long x) 
x initially in rdi 





1 dw. loop: 

2 movq žrdi, %rax Copy x to %rax 

3 movq Wrdi, Arcx 

4 imulq žrdi, %rcx Compute y = x*x 
5 leaq (Ardi,4rdi), %rax Compute n = 2*x 
6 .L2: loop: 

7 leaq 1Circx,%rax), %rax Compute x += y + 1 
8 subq $1, %rdx Decrement n 

9 testq &rdx, %rdx Test n 

10 j£ .L2 If > 0, goto loop 
11 rep; ret Return 





Solution to Problem 3.24 (page 224) 


This assembly code is a fairly straightforward translation of the loop using the 
jump-to-middle method. The full C code is as follows: 






long loop_while(long a, long b) 
{ 





long result = 1; 
1x 
while (a« b) { 
result - result * (a*b); 
a = atl; 








} 


return result; 







Solution to Problem 3.25 (page 226) 
While the generated code does not follow the exact pattern of the guarded-do 
translation, we can see that it is equivalent to the following C code: 







long loop_while2(long a, long b) 
{ 






long result = b; 

while (b > 0) { 
result = result * a; 
b = b-a; 








} 


return result; 







We will often see cases, especially when compiling with higher levels of opti- 
mization, where Gcc takes some liberties in the exact form of the code it generates, 
while preserving the required functionality. 
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Solution to Problem 3.26 (page 228) 
Being able to work backward from assembly code to C code is a prime example 
of reverse engineering. 
A. We can see that the code uses the jump-to-middle translation, using the jmp 
instruction on line 3. 
B. Here is the original C code: 












long fun a(unsigned long x) { 
long val = 0; 








HER while (x) { 

: val “= x; 
i x >>= 1; 
) } 






! 
a return val & 0x1; 


} 


C. This code computes the parity of argument x. That is, it returns 1 if there is 
an odd number of ones in x and 0 if there is an even number. 









Solution to Problem 3.27 (page 231) 
This exercise is intended to reinforce your understanding of how loops are;imple- 







ma mented. 


| long fact for .gd. goto(long n) 
nll 1 







long i = 2; 

" long result - 1; 

| ! if (n <= 1) 

goto done; 
loop: 

j result *- i; 

i itt; 

if (i <= n) 
goto loop; 










done: 
return result; 








Y 


Solution to Problem 3.28 (page 231) 
This problem is trickier than Problem 3.26, since the code within the Joop is more 


complex and the overall operation is less familiar. 









vr 


A, Here is the original C code; 





long fun, b(unsigned long x) { ; . 
long val = 0; i 
long i; 
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for (i = 64; i != 0; i-~) { 
val =.(val << 1), | (x.& Oxi); , + 
t t OX >>= 1; I Y “ 
} 
return val; 


*} 


B. The code was generated using the guarded-do transformation, but the com- 
„piler detected that, since i is initialized to 64, it will satisfy the test i Z 0, and 
therefore the initial test is not required. 


C. This code reverses the bits ip x, creating a mirror image. It does this by 
shifting the bits of x from left to right, and then filling these bits in as it 
shifts val from right to left. 


Solution to Problem 3.29- (page 232) 
Our stated rule for translating a for loop' into a-while loop i$' just a bit too 
simplistic—this is the only aspect that requires special consideration. 
A. Applying our translation Tule would yield the following códe: ín 
a + 
/* Naive translation of for loop into while loop */ 
/* WARNING? This is buggy code */ 
long sun,= 0; > 
long i = 0; 
while (i < 10) { 
if (i & 1) 
/* This will cause an ihtinite loop-*/ 
continue; 
sum *- i; 
i++; 


, 


T 


} 


This code has an infinite loop, since the continue statement would prevent 
index variable i frórh being üpdated. 


B. The genéral solution is to repláce the continue statement with a ‘goto 


+ Y 


statement that skips the rest of the loop body and goes directly to the update’ 
portion: i 


A 


/* Correct translation of for loop into while loop */ 
long sum = 0; 
long i = 0; 
while (i < 10) { 

if (i & 1) 

goto update; 

sum += i; 
update: 

itt; 


Y 
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Li 


Solution to Problem 3.30 (page 236) 

This problem gives you a chance to reason about the control flow of a switch 
statement. Answering the questions requires you to combine information from 
several places in the assembly code. 


* Line 2 of the assembly code adds 1 to x to set the lower range of the cases to 
zero. That means that the minimum case label is —1. 

* Lines3 and 4 cause the program to jump to the default case when the adjusted 
case value is greater than 8, This implies that the maximum case label is 
-148-7. 

* [n the jump table, we see that the entry on lines 6 (case value 3) and 9 (case 
value 6) have the same destination (.L2) as the jump instruction On line 4, 
indicating the default case behavior. Thus, case labels 3 dnd 5 are missing in 


the switch statement body. 

| * In the jump table, we see that the entries on lines 3 and 10 have the same j 

Doo: destination. These correspond to cases 0 and 7. i 

* In the jump table, we see that the entries on lines 5 and 7 have the same 
destination. These correspond to cases 2 and 4. 





= 


a pe P REPE A err a ee 


From this reasoning, we draw the following conclusions: 


ee 


A. The case labels in the switch statement body have values —1, 0, 1, 2, 4, 5, i; 
| and 7. 1 
4 B. The case with destination .L5 has labels 0 and 7. j 
C. The case with destination ,L7 has labels 2 and 4. 


| i Solution to Problem 3.31 (page 237) 


E The key to reverse engineering compiled switch statements is to combine the 
| information from the assembly code and the jump table to sort out the different 

y Jump 

j cases, We can see from the ja instruction (line 3) that the code for the default case 


f has label .L2. We can see that the only other repeated label in the jump table is 

A .L5, and so this must be the code for the cases C and D. We can see.that the code 

EE falls through at line 8, and so label .L7 must match case A and label .L3 must 
E match case B. That leaves only label . L6 to match case E. 

| d The original C code is as follows: 


void switcher(long a, long b, long c, long *dest) 
1 
long val; 
switch(a) 1 
case 5; 
c= b^ 15; 
/* Fall through */ 
case O: 
val = c + 112; 
break; 


| 
| 
| 
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case 2: ` 
case T: 
val = (c + b) << 2; i 
break; 
case 4: 
val = a; 
break; 
default: 
val = b; 
} 
*dest = val; 


T 


Solution to Problem 3.32 (page.244) 

Tracing through the program execution at this level-of detail reinforces many 
aspects of procedure call and return. We can see clearly how control is passed to 
the function when it is called, and how the calling function resumes upon return. 
We can also see how arguments get passed through registers %rdi and %rsi, and 
how results are returned via register %rax. 








Instruction State values (at beginning) 

Label PC Instruction %rdi %rsi Y%rax ‘rsp ` —— *Arsp 
M1 0x400560 callq 10 -— —  'OxTfffffffe820 — 

F1 -0x400548 lea 10 — — Ox7fffffffe818 0x400565 
F2 0x40054c sub 10 11 — Ox7fffffffe818 0x400565 
F3, Ox400550 callq 9 11 — Ox7fffffffe818 Ox400565 
L1 0x400540 mov 9 11 — Ox7fffffffe810 0x400555 
L2 0Ox400543 imul 9 11 9 Ox7fffffffe810 0x400555 
L3 Ox400547 retq 9 11 99  OxTfffffffe810 0x400555 
F4 0x400555 repz repq 9 11 99 Ox7fffffffe818 0x400565 
M2 0x400565 mov 9 11 99  Ox7TfffffffeB820 — 


Solution to Problem 3.33 (page 246) à 
This problem is a bit tricky due to the mixing of different data sizes. 

Let us first describe one answer and then explain the second possibility. If 
we assume the first addition (line 3) implements *u += a, while thesecond (line 4) 
implements v += b, then we can see that a was passed as the first argument in edi 
and converted from 4 bytes to 8 before adding it to the 8 bytes pointed to by %rdx. 
This implies that a must be of type int and u must be of type long *. We can also 
see that the low-order byte of argument b is added to the byte pointed to by %rcx. 
This implies that v must be of type char *, but the type of b is ambiguous—it could 
be 1,2, 4, or 8 bytes long. This ambiguity is resolved by noting the return value of 
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Description 
Call first (10) 


Entry of first 


Call last (9, 11) 


Entry of last 


Return 99 from last 
Return 99 from first 


Resume main 





—————— — — — —XÀÓÀ á—À 5 
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6, computed as the sum of the sizes of a and b. Since we know a is 4 bytes long, 
we can deduce that b must be 2. 
An annotated version of this function explains these details: 


int procprobl(int a, short b, long *u, char *v) 
a in Zedi, b in %si, u in %rdx, v in rex 


1 procprob: 

2 movslq edi, %rdi Convert a to long 

3 addq žrdi, Chrdx) Add to *u (long) 

4 addb Zsil, (rcx) Add low-order byte of b to *v 
5 movl $6, “eax Return 4+2 

6 ret 


Alternatively, we can see that the same assembly code would be valid if the 
two sums were computed in the assembly code in the opposite ordering as they are 
in the C.code. This would result in interchanging arguments'a and b and arguments 
u and v, yielding the following prototype: rS 


1 n 
int procprob(int b, short a, long *v, char *u); 


Solution to Problem 3.34 (page 252) 
This example demonstrates the use of callee-saved registers as well as the stack 
for holding local data. 


A. We can see that lines 9-14 save local values a0—a5 into callee-saved registers 
Vrbx, %r15, 4114, 4x13, 4112, and %rbp, respectively. 


B. Local values a6 and a7 are stored on the stack at offsets 0 and 8 relative to 
the stack pointer (lines 16 and 18). 


C.« After storing six local variables, the program has used up the supply of callee- 
saved registers. It stores the remaining two local values on the stack. 


Solution to Problem 3.35 (page 254) 

This problem provides a chance to examine the code for a recursive function. An 
important lesson to learn is that recursive,code has the exact same structure as the 
other functions we have seen. The stack and register-saving disciplines suffice to 
make recursive functions operate correctly. 


A. Register %rbx holds the value of parameter x, so that it can be used to 
compute the result expression. 


B. The assembly code was generated from the following C code: 


long rfun(unsigned long x) 1 
if (x == 0) 
' return 0; / 
unsigned long nx = x>>2; 
long rv = rfun(nx); t 
return:x + rv; 
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Solution to Problem 3.36 (page 256) 

This exercise tests your understanding of data sizes and array indexing. Observe 
that a pointer of any kind is 8 bytes long. Data type short requires 2 bytes, while 
int requires 4. x ? 


Array Element size Total size Start address Element i 


8 2 14 Xs Xg + 2i 
T 8 24 XT Xp + 8i 
U 8 48 Xy Xy + 8i 
V 4 32 Xy Xy + di 
W 8 32 . Xy X4 + 8i 


aÈ 
Solution to Problem 3.37 (page 258) i 
This problem is a variant of the one shown for integer array E. It is important to 
understand the difference between a pointer and the object being pointed to. Since 
data type short requires 2 bytes, all of the array indices are scaled by-a factor of 
2. Rather than using movi, as before, we now use movw. 


Expression Type Value Assembly 

S41 Short* xg--2 leaq 2(4érdx) , 4rax 

S[3] Short M[xg + 6] movw 6 (Ardx) , fax 

ES [i] Short* xg4+2i leaq (hrdx,%rex,2) ,%rax 
S[4*i+1] short M[xs -8i +2] — movw 2(%rdx,%rex,8) , fax 
8+i-5 short * xg + 2i — 10 leaq -10(4rdx, %rcx, 2) , %rax 


Solution to Problem 3.38 (page 259) 

This problem requires you to work through the scaling operations to determine 
the address computations, and to apply Equation 3.1 for row-major indexing. The 
first step is to annotate the assembly code to determine how the addréss references 
are computed: 


‘along sum_element (long i, long j) 
i in rdi, j in %rsi 
1 sum, element: 


2 leáq: OC,4rdi,8), %rdx Compute 8i 

3 subq žrdi, %rdx Compute "i 

4 addq rsi, %rdx Compute fi + j 

5 leag (Arsi,4Zrsi,4), %rax ‘Compute Sj 

6 addq ^rax, %rdi Compute i + 5j 

7 movq QC, Zrdi,8), %rax Retrieve Mixg + 8 (5j + ij] 
8 addq PC,Zrdx,8), %rax Add M[xp + 8 (7i + jj] 

9 ret 


We can see that the reference to matrix P is at byte offset 8 - (7i + j), while 
the reference to matrix Q is at byte offset 8 - (5 j + i). From this, we can determine 
that P has 7 columns, while Q has 5, giving M = 5 and N — 7. 


4 
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Solution to Problem 3.39 (page 262) 
These computations are direct applications of Equation 3.1: 







* For L —4, C = 16, and j = 0, pointer Aptr is computed as x, + 4- (16i +0) = 


xa + 64i. 
* For L —4, C =16,i =0, and j =k, Bptr is computed as xg + 4- (16.0 +x) = 


Xp + 4k. 
* For L=4, C=16, i=16, and j=k, Bend is computed as xg +4» 


ID 
3 (16-16 +k) = xg + 1,024 + 4k. 










EN Solution to Problem 3.40 (page 262) 
i; This exercise requires that you be able to study compiler-generated assembly code 
| i to understand what optimizations have been performed. In this case, the compiler 
i was clever in its optimizations. 3 
4 Let us first study the following C code, and then see how it is derived fromthe — 3 
: | assembly cdde generated for the original function. 















a /* Set all diagonal elements to val */ 
b void fix set diag opt(fix matrix A, int val) { 
int *Abase = &A[0](0]; 











E. 
E long i = 0; 
i i long iend = N*(N*1); 
EN | do { 
j Abase[i] = val; ; 
i += (1); 





} while (i f= iend); 





} 


This function introduces a variable; Abase, of type int *, pointing to the start 
of array A. This pointer designates a sequence of 4-byte integers consisting of 
elements of A in row-major order. We introduce an integer variable index that 
steps through the diagonal elements of A, with the property that diagonal elements 
i andi + 1arespaced N + 1elements apart in the sequence, and that once we reach 
diagonal element N (index value N(N + 1)), we have gone beyond the end. 

The actual assembly code follows this general form, but now-the pointer 
increments must be scaled by a factor of 4, We label register %rax as holding a value 
index4 equal to index in our C version but scaled by-a factor of 4. For N.= 16, we 
can see that our stopping point for index4 will be 4 - 16(16 + 1) = 1,088. 


t 




















1 fix_set_diag: 
void fix_set_diag(fix_matrix A, int vai) 


A in &rdi, val in &rsi 
movl $0, %eax 






Set index4 = 0 






——— a m 






2 

3 .L13: loop: 

4 movl nesi, CArdi,Xrax) Set Abase[index4/4] to val 
5 addq $68, Trax P Increment index4 += 4(N+1) 


d 
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6 cmpq $1088, %rax Compare index4: 4N(N+1) 
7 jne .L13 If t=, goto loop 
8 rep; ret Return 


Solution to Problem 3.41 (page 268) 


This problem gets you to think about structure layout and the code used to access 
structure fields. The structure declaration is a variant of the example shown in 
the text. It shows that nested structures are allocated by embedding the inner 
structures within the outer ones. 


A. The layout of the structure is as follows: 


Offset 0 8 12 16 24 


B. It uses 24 bytes. 
C. As always, we start by annotating the assembly code: 


void sp init(struct prob *sp) 


sp in Zrdi 
1 Sp. init: 
2 movl 12(%rdi), %eax Get, sp->s.y 
3 movl eax, 8(%rdi) Save in sp->s.x 
4 leaq 8(%rdi), %rax Compute &(sp->s.x) 
5 movq rax, (%rdi) Store in sp->p 
6 movq žrdi, 16(%rdi) Store sp in sp->next 
7 ret 


From this, we can generate C code as follows: 


void sp_init(struct prob *sp) 


1 
Sp-^5s.x = sp->s.y; 
sp->p = &(spr>s.x); 
Sp-»next = sp; 

} 


Solution to Problem 3.42 (Page 269) 

This problem demonstrates how a very common data structure and operation on 
it is implemented in machine code. We solve the problem by first annotating the 
assembly code, recognizing that the two fields of the structure, are at offsets 0 
(for v) and 8 (for p) 


long fun(struct ELE *ptr) 

ptr in %rdi 
1 fun: 
2 movl $0, eax result = 0 
3 jmp .L2 Goto middle 
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4 .L3: loop: 

5 addq (rdi), %rax result += ptr-»v 

6 movq 8Cirdi), žrdi ptr = ptr->p 

7 .L2: middle: 

8 testq Yrdi, rdi Test ptr 

9 jne .L3 If t= NULL, goto loop 
o rep; ret 


A. Based on the annotated code, we can generate a C version: 


long fun(struct ELE *ptr) { 
long val = 0; 
while (ptr) { 
val += ptr-?v; 
ptr = ptr-?p; 
y 
return val; 


} 


B. We can see that each structure is an element in a singly linked list, with field 
v being the value of the element and p being a pointer to the next element. 
Function fun computes the sum of the element values in thé list. 


Solution to Problem 3.43 (page 272) 
Structures and unions involve a simple set of concepts, but it takes practice to be 
comfortable with the different referencing patterns and their implementations. 


EXPR TYPE Code 


up-»ti.u long movq (%rdi), %rax 
movg Xrax, (%rsi) 


* Li * 
up-ti.v short movw 8(rdiY, Lax 
movw %ax, (Arsi) 


&up-^ti.w addq $10, "idi 
movq žrdi, (Arsi) 


up->t2.a movg žrdi, (%rsi) 


up-^t2.a[üp-»ti.u] i mova (rdi) , trax 
movl Cirdi,%rax,4) , hoax 


movl %eax, (4rsi) 


#up->t2.p mova 8 (rdi), Árax 
movb (%rax), kal 
movb fal, (4rsi) 
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Solution to Problem 3.44 (page 275) 
Understanding structure layout and alignment is very important for understand- 
ing how much storage different data structures require and for understanding the 


code generated by the compiler for accessing structures. This problem lets you 
| work out the details of some example structures. 


| A. struct P1 inti; char c; int ji chard; }; 


| i c j d Total Alignment 
| 0 4 8 12 16 4 


B. struct P2 ( int i; char c; cher d; long j; }; 


i c d j Total ^ Alignment 
0 4 5 8 16 8 


C. struct P3 { short w[3]; char c[3] }; 


w c Tota Alignment 
6 10 2 


D. struct P4 { short w[5]; char *c[3] }; 


W c Total Alignment 
Ie m Mic hib a 
0 16 40 8 


E. struct P5 { struct P3 a[2]; struct P2 t ); 


Li 


a t Total Alignment 
24 40 8 


Solution to Problem 3.45 (page 275) 
This is an exercise in understanding structure layout and alignment. 


A. Here are the object sizes and byte offsets: 


Field a b c d e f g 


Size 8 2 8 1 4 1 8 
Offset 0 8 16 24 28 32 40 48 






B. The structure is a total of 56 bytes long. The end of the structure must be 
padded by 4 bytes to satisfy the 8-byte alignment requirement. 


C. One strategy that works, when all data elements have a length equal to a 
power of 2, is to order the structure elements in descending order of size. 
This leads to a declaration 
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struct { 
char 
double 
long 
float 
int 
short 
char 
char 

} rec; 


with the following offsets: 


a 


Size 8 8 
Offset 0 8 


The structure must be padded by 4 bytes to satisfy the 8-byte alignment 
requirement, giving a total of 40 bytes. 


Solution to Problem 3.46 (page 282) 

This problem covers a wide range of topics, such as stack frames, string represen- 
tations, ASCII code, and byte ordering. It demonstrates the dangers of out-of- 
bounds memory references and the basic ideas behind buffer overflow. 


A. Stack after line 3: 


00 00 00 00 00 40 OO 76] Return address 
01 23 45 67 89 AB CD EF] Saved %rbx 


tt = np 


B. Stack after line 5: 


00 00 00 00 00 40 00 34] Return address 
33 32 31 30 39 38 37 36! Saved 4rbx 


35 34 33 32 31 30 39 38 
37 36 35 34 33 32 31 30|-—— buf = %rsp 


C. The program is attempting to return to address 0x040034. The low-order 2 
bytes were overwritten by the code for character ‘4’ and the terminating null 
character, 

D. The saved value of register %rbx was set to 0x3332313039383736. This value 
will be loaded into the register before get. line returns. 
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E. The callto malloc should have had strien(buf) +1 as its argument, and the 
code should also check that the returned value is not equal to NULL. 


Solution to Problem 3.47 (page 286) 


A. This corresponds to a range of around 2? addresses. 


B. A128-bytenop sled would cover 2’ addresses with each test, and so we would 
only require around 2° = 64 attempts. 


This example clearly shows that the degree 5f randomization in this version 
of Linux would provide only minimal deterrence against an overflow attack. 


Solution to Problem 3.48 (page 288) 
This problem gives you another chance to see how x86-64 code mana ges the stack, 
and to álso better understand how to defend against buffer overflow attacks. 


A. For the unprotected code, we can see that lines 4 and 5 compute the positions 
of v and buf to be at offsets 24 and 0 relative to %rsp. In the protected code, 
the canary is stored at offset 40 (line 4), while v and buf are at offsets 8 and 
16 (lines 7 and 8). 


B. In the protected code, local variable v is positioned closer to the top of the 
stack than bu£, and so an overrun of buf will not corrupt the value of v. 


Solution to Problem 3.49 (page 293) 
This code combines many of the tricks we have seen for performing bit-level 
arithmetic. It requires careful study to make any sense of it. 


A. The leaq-instruction of line 5 computes the value 8n + 22, which is then 
roünded down to the nearest multiple of 16 by the andq instruction of line 6. 
The resulting value will be 8n + 8 when n is odd and 8n + 16 when n is even, 
and this value is subtracted from s, to give sz. 


B. The three instructions in this sequence round s; up to the nearest multiple 
of 8. They make use of the combination of biasing and shifting that we saw 
for dividing by a power of 2 in Section 2.3.7. 


C. These two examples can be seen as the cases that minimize and maximize 
the values of e, and ej. 


n $1 52 P ey €2 
5 2065 2,017 2,024 1 7 


6 2,064 2,000 2,000 16 0 


D. We can see that sz is computed in a way that preserves whatever offset sı has 
with the nearest multiple of 16. We can also see that p will O6 aligned on a 
multiple of 8, as is recommended for an array of 8-byte eleménts. 


Solution to Problem 3.50 (page 300) 

This exercise requires that you step through the code, paying careful attention to 
which conversion and data movement instructions are used. We can see the values 
being retrieved and converted as follows: 
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e The value at dp is retrieved, converted to an int (line 4), and-then stored at 
ip. We can therefore infer that vali is d. i 

e The value at ip is retrieved, converted to a float (line 6), and then stored at 
fp. We can therefore infer that va12 is i. 

e The value of 1 is converted to a double (line 8) ánd stored at dp. We can 
therefore infer that vals is 1. , 

» The value at fp is retrieved on line 3. The two instructions 'at- lines 10-11 
convert this to double precision as the.value returned in register %4xmn0. We 
can therefore infer that val4 is f. i 


Solution to Problem 3.51 (page 300) 
These cases can be handled by,selecting the appropriate entries frọm the tables in 
Figures 3.47 and 3.48, or using one of the code sequences for converting between 


floating-point formats. 
f 


T, T, Instruction(s) 


Ax 

long double  vevtsi2sdq %rdi, %xmm0, %xmm0 

double int vevttsd2si %xmm0, Zeax 

float double vunpéklpd 4xmmO, jxmmO, 4xmmO 
vevtpd2ps %xmm0, %xmm0 

long float vcvtsi2ssq %rdi, %xmmO, %xmm0 

float long, vcvttss2sig Zxmm0, rax 





Solution to Problem 3.52 (page 301) i 3 
'The basic rules for mapping arguments to registers are fairly simple (although they 
become much more complex with more and other types of arguments.[77]). 


723 


A. double gi(double a, long b, float c, int d); 


t 


Registers: a in %xmmo, b in Ardi c inJixmm1, d in Zesi. 


. double g2(int a, -double *b, float *c, long di^ 
n 


Registers: a in Zedi, b in 4rsi, cin 4rdx, d in 4rcx 
. double g3(double *a, double b, int c, float d); 


Registers: a in %rdi, b in %xmm0, c in %esi, d in %xmm1 


D. double g4(float a, int xb, float c, dduble d); 


Registers: a in %xmm0, b in %rdi, c in %xmm1, d in Axmm2 

Solution to Problem 3.53 (page 303) i 

We can see from the assembly code that there are two integer arguments, passed 

in registers %rdi and %rsi. Let us name these i1 and i2. Similarly, there are two | 

floating-point arguments, passed in registers Zxmm0 and %xmmi, which we name f1 j 

and £2. r " ; 
We can.then annotate the assembly'code: 
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Refer to arguments as ii (Zrdi), i2 (esi) 
fi (ZxmmO) , and £2 (xmumi) 
2 


double functi(argi_t p, arg2 t q, arg3_t r, argá t s) 


1 functi: 

2 vévtsi2ssq Arsi, %xmm2, /4xmm2 Get i2 and convert from long to float 

3 vaddss  4xmmO, %xmm2, %xmm0 Add fi (type float) 

4 vcvtsi2ss edi, %xmm2, %xmm2 Get if and convert from int to float 

5 vdivss %xmm0, %xmm2, %xmm0 i Compute ii / (i2 + fi) 

6 vunpcklps X4xmmO, %xmmO, 4xmmO 

7 vcvtps2pd AxmmO, %xmmO Convert to double 

8 vsubsd, %xmm1, %xmm0, %xmmO Compute ii / (i2 + fi) - £2 (double) 
ret 


From this we see that the codé computes the value i1/(i2+£1)-£2. We'can also 
see that i1 has type int, i2 has type long, f1 has type float, and f2 has type 
double. The only ambiguity in matching arguments to the named values stems 
from the commutativity of multiplication—yielding two possible results: 


double functia(int p, float q, long r, double s); 
double functib(int p, long q, float r, double s); 


Solution to Problem 3.54 (page 303) 
This problem can readily be solved by stepping through the assembly code and 
determining what is computed on each step, as shown with the annotations below: 


double funct2(double w, int x, float y, long z) 
w in 4xmmO, x in Zediy y in Zxmmi, z in Xrsi 


1 funct2: 

2 vcvtsi2ss “wedi, “kmm2, %xmm2 Convert x to float 

3 vmulss %xmm1, %xmm2, %xmmi Multiply by y 

4 vunpcklps Axmmi, 4xmmi, %xmm1 

5 vevtps2pd Axmmi, %xmm2 Convert x*y to double 
6 vcvtsi2sdq 4rsi, Axmmi, %xmmi Convert z to double 

7 vdivsd  A4xmmi, %xmm0, %xmmO0 Compute w/z 

8 vsubsd  4xmmO, %xmm2, %xmmO Subtract from x*y 

9 ret : Return 


We can conclude from this analysis that the function computes y * x — w/z. 


Solution to Problem 3.55 (page 305) 
This problem involves the same reasoning as was required to see that numbers 
declared at label .LC2 encode 1.8, but with a simpler example. 

We see that the two values are 0 and 1077936128 (0x40400000). From the 
high-order bytes, we can extract an exponent field of 0x404 (1028), from which 
we subtract a bias of 1023 to get an exponent of 5. Concatenating the fraction bits 
of the two values, we get a fraction field of 0, but with the implied leading value 
giving value 1.0. The constant is therefore 1.0 x 2° = 32.0. 
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Solution to Problem 3.56 (page 305) 


A. We see here that the 16 bytes starting at address .LC1 form a mask, where 
the low-order 8 bytes contain all ones, except for the most significant bit, 
which is the Sign bit of a double-precision value. When we compute the AND 
of this mask with %xmm0, it will clear the sign bit of x, yielding the absolute 
value. In fact, we generated this code by defining EXPR(x) to be fabs (x), 
where fabs is defined in «math.h». 


. We see that the vxorpd instruction sets the entire register to zero, and so this 
is a way to generate floating-point constant 0.0. " 


-. We see that the 16 bytes starting at address .LC2 form a mask with a single 
1 bit, at the position of the sign bit for the low-order’ value in the XMM 
register. When we compute the EXCLUSIVE-OR of this mask with %xmmO, we 
change the sign of x, computing the expression -x. 4 


Solution to Problem 3.57 (page 308) 
Again, we annotate tlie'code, inclüdihg dealing with the conditional branch: ' 


double funct3(int *ap, double b, long c, float *dp) 
ap in žrdi, b in ZxmmO, c in &rsi, dp in rdx 
funct3: 
vmovss (%rdx), Axmmi Geg d = *dp egy 
vevtsi2sd (žrdi), %xmm2, Axmm2, NS. a = xap and convert to double 
vucomisd pxmmg, %xmm0 Compare b:à p 
jbe .L8 If <=, goto lesseq 
vevtsi2ssq %rsi, %xmm0, ^xmmO Convert c to float 
vmulss %xmmi, AxmmO, %xmm1 Multiply by d 
vunpcklps "xmmi, 4xmmi, 4xmmi 
vcvtps2pd *xmmi, %xmmO Contert to double 
ret Return 
.L8: lesseq: 
vaddss ‘%xmmi, %xmmi, %xmmi Compute d+d = 2.0 * d 
vevtsi2ssq Yrsi, %xmmd, %xmm0 Convert c to float 
vaddss %xmmi, XxmmO, %xmmO Compute c *.2*d 
vunpcklps XxmmO, %xmmO, AxmmO 
vcvtps2pd %xmmO, XxmmO Convert to double 


ret Return 
; 


From this, we can write the following code for funct3: 


double funct3(int; *ap;sdouble b, long c, float *dp) f 
int a = *ap; '' 
float d = *dp; 
if (a < b) 
return c*d; 
else ) 
return c*2*d;i 
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odern microprocessors are among the most complex systems ever created 

by humans. A single silicon chip, roughly the size of a fingernail, can con- 
tain several high-performance processors, large cache memories, and the logic 
required to interface them to external devices. In terms of performance, the pro- 
cessors implemented on a single chip today dwarf the room-size supercomputers 
that cost over $10 million just 20 years ago. Even the embedded processors found 
in everyday appliances such as cell phones, navigation systems, and programmable 
thermostats are far more powerful than the early developers of computers could 
ever have envisioned. 

So far, we have only viewed computer systems down to the level of machine- 
language programs. We have seen that a processor must execute a sequence of 
instructions, where each instruction performs some primitive operation, such as 
adding two numbers. An instruction is encoded in binary form as a sequence of 
] or more bytes. The instructions supported by a particular processor and their 
byte-level encodings are known as its instruction set architecture (ISA). Different 
“families” of processors, such as Intel IA32 and x86-64, IBM/Freescale Power, 
and the ARM processor family, have different ISAs. A program compiled for one 
type of machine will not run on another. On the other hand, there are many dif- 
ferent models of processors within a single family. Each manufacturer produces 
processors of ever-growing performance and complexity, but the different models 
remain compatible at the ISA level. Popular families, such as x86-64, have pro- 
cessors supplied by multiple manufacturers. Thus, the ISA provides a conceptual 
layer of abstraction between compiler writers, who need only know what instruc- 
tions are permitted and how they are encoded, and processor designers, who must 
build machines that execute those instructions. 

In this chapter, we take a brief look at the design of processor hardware. We 
study the way a hardware system can execute the instructions of a particular ISA. 
This view will give you a better understanding of how computers work and the 
technological challenges faced by computer manufacturers. One important con- 
cept is that the actual way a modern processor operates can be quite different 
from the model of computation implied by the ISA. The ISA model would seem 
to imply sequential instruction execution, where each instruction is fetched and 
executed to completion before the next one begins. By executing different parts 
of multiple instructions simultaneously, the processor can achieve higher perfor- 
mance than if it executed just one instruction at a time. Special mechanisms are 
used to make sure the processor computes the same results as it would with se- 
quential execution. This idea of using clever tricks to improve performance while 
maintaining the functionality of a simpler and more abstract model is well known 
in computer science. Examples include the use of caching in Web browsers and . 
information retrieva! data structures such as balanced binary trees and hash tables. , 

Chances are you will never design your own processor. This is a task for | 
experts working at fewer than 100 companies worldwide. Why, then, should you i 
learn about processor design? 


o [tis intellectually interesting and important. There is an intrinsic value in learn- | 
ing how things work. It is especially interesting to learn the inner workings of , 
| 
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Panini: Ws "ovo 


Aside "The progress of computer technology. E 


To get"a sense of how.fnuch computer technólogy has iniproved over the past four decades, consider 
the following two processors.: ý 7 

The first Cray I'supercomputér was delivered to Los Alamos National Laboratory in 1976. It was 
the fastest computer in the World, able to „perform as many-a$ 250 million arithmetic operations per 

* second. It came with 8 mégabytes*of random access niemory,'the maximum configuration allowed by 
the hardware. The machine was also very large—it weighed 5,000 kg, consumed 115 kilowatts, and cost 
$9million: In total, arourid 80 of them-were rháhufactured. 

* The Apple ARM AT: microprocessor chip, introduced ii 2013 to power the iPhone 5S, contains 
two CPUs, each of which can perform several billion arithmetic operation’ per Second, and 1 gigabyte 
of randonf access memory. The eütiré phone weighs just 112 grams;cóhsunie$ around 1 watt, and costs 
less than $800. Over 9 million units were’ sold in thé first Weekend of its introduction. In addition to 
being a powerful computer, it can be used to take pictures, to'place phoné calls, and to provide driving 
directions, features never tdnsidered for thé Cray 1. 

These two systems, spaced just 37 years apart, demonstrate the tremendous progress of semicon- 
ductor technology. Whéreds the Cray I's CPU was Constructed usirig around 100,000 semiconductor 
chips, ‘each cdntaining less than 20 ti'ansistors, tlie Applé A7 has over 1 billion transistors on its single 
chip. The Ctay 1’s 8-megabyte memory requiréd 8,192 chips, whéréas Me iPhone's gigabyte memory is 
contained i ina single chip. 


veo 
LE t 


a system that is such a part of the daily lives of computer scientists and engi- 
neers and yet remains a mystery to many. Processor design embodies many of 
the principles of good engineering practice. It requires creating a simple and 
regular structure to perform a complex task. 


Understanding how the processor works aids in understanding how the overall 
computer system works. In Chapter 6, we will look at the memory system and 
the techniques used to create an image of a very large memory with a very 
fast access time. Seeing the processor side of the processor-memory interface 
will make this presentation more complete. 


Although few people design processors, many design hardware systems that 
contain processors. This has become commonplace as processors are embed- 
ded into real-world systems such as automobiles and appliances. Embedded- 
system designers must understand how processors work, because these sys- 
tems are generally designed and programmed at a lower level of abstraction 
than is the case for desktop and server-based systems. 


* You just might work on a processor design. Although the number of compa- 
nies producing microprocessors is small, the design teams working on those 
processors are already large and growing. There can be over 1,000 people 
involved in the different aspects of a major processor design. 


In this chapter, we start by defining a simple instruction set that we use as a 
running example for our processor implementations. We call this the “Y86-64” 








354 Chapter 4 Processor Architecture 


instruction set, because it was inspired by the x86-64 instruction set. Compared 
with x86-64, the Y86-64 instruction set has fewer data types, instructions, and 
addressing modes. Jt also has a simple byte-level encoding, making the machine 
code less compact than the comparable x86-64 code, but also much easier to design 
the CPU’s decoding logic. Even though the Y86-64 instruction set is very simple, 
it is sufficiently complete to allow us to write programs manipulating integer data. 
Designing a processor to implement Y86-64 requires us to deal with many of the 
challenges faced by processor designers. 

We then provide some background on digital hardware design. We describe 
the basic building blocks used in a processor and how they are connected together 
and operated. This presentation builds on our discussion of Boolean algebra and 
bit-level operations from Chapter 2. We also introduce a simple language, HCL 
(for “hardware control language”), to describe the control portions of hardware 
systems. We will later use this language to describe our processor designs. Even if 
you already have some background in logic design, read this section to understand 
our particular notation. 

As a first step in designing a processor, we present a functionally correct, 
but somewhat impractical, Y86-64 processor based on sequential operation. This 
processor executes a complete Y86-64 instruction on every clock cycle. The clock 
must run slowly enough to allow an entire series of actions to complete within one 
cycle. Such a processor could be implemented, but its performance would be well 
below what could be achieved for this much hardware. 

With the sequential design as a basis, we then apply a series of transforma- 
tions to create a pipelined processor. This processor breaks the execution of each 
instruction into five steps, each of which is handled by a separate section or stage of 
the hardware. Instructions progress through the stages of the pipeline, with one in- 
struction entering the pipeline on each clock cycle. As a result, the processor can 
be executing the different steps of up to five instructions simultaneously. Mak- 
ing this processor preserve the sequential behavior of the Y86-64 ISA requires 
handling a variety of hazard conditions, where the location or operands of one 
instruction depend on those of other instructions that are still in the pipeline. 

We have devised a variety of tools for studying and experimenting with our 
processor designs. These include an assembler for Y86-64, a simulator for running 
Y86-64 programs on your machine, and simulators for two sequential and one 
pipelined processor design. The control logic for these designs is described by 
files in HCL notation. By editing these files and recompiling the simulator, you can 
alter and extend the simulator’s behavior. A number of exercises are provided that 
involve implementing new instructions and modifying how the machine processes 
instructions. Testing code is provided to help you evaluate the correctness of your 
modifications. These exercises will greatly aid your understanding of the material 
and will give you an appreciation for the many different design alternatives faced 
by processor designers. 

Web Aside ARCH:vVLOG on page 467 presents a representation of our pipelined 
Y86-64 processor in the Verilog hardware description language. This involves 
creating modules for the basic hardware building blocks and for the overall pro- 
cessor Structure. We automatically translate the HCL description of the control 
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logic into Verilog. By first debugging the HCL description with our simulators, we 
eliminate many of the tricky bugs that would otherwise show up in the hardware 
design. Given a Verilog description, there are commercial and open-source tools 
to support simulation and logic synthesis, generating actual circuit designs for the 
microprocessors. So, although much of the.effort we expend here is to create picto- 
rial and textual descriptions of a system, much as one would when writing software, 
the fact that these designs can be automatically synthesized demonstrates that we 
are indeed creating a system that can be realized as hardware. 


4.1 The Y86-64 Instruction Set Architecture 


Defining an instruction set architecture, such as Y86-64, includes defining the 
different components of its state, the set of instructions and their encodings, a 
set of programming conventions, and the handling of exceptional events. 


4.1.1 Programmer-Visible State 


AsFigure 4.1 illustrates, each instruction in a Y86-64 program can read and modify 
some part of the processor state. This is referred to as the programmer-visible 
state, where the "programmer" in this case is either someone writing programs 
in assembly code or a compiler generating machine-level code. We will see in our 
processor implementations that we do not need to represent and organize this 
state in exactly the manner implied by the ISA, as long as we can make sure that 
machine-level programs appear fo have access to the programmer-visible state. 
The state for Y86-64 is similar to that for x86-64. There are 15 program registers: 
Arax, rcx, 4rdx, %rbx, %rsp, Arbp, %rsi, 4rdi, and %r8 through 4r14. (We omit 
the x86-64 register 4r15 to simplify the instruction encoding.) Each of these stores 
a 64-bit word. Register %rsp is used as a stack pointer by.the push, pop, call, and 
return instructions. Otherwise, the registers have no fixed meanings or values. 
There are three ‘single-bit condition codes, ZF, SF, and OF, storing information 


Figure 4.1 RF: Program registers 
Y86-64 programmer- 
visible state. As with 
x86-64, programs for Y86- 
64 access and modify 

the program registers, 

the condition codes, the 


rogram counter (PC), and CC: : 
prog (PC) Condition Stat: Program status 


the memory. The status odes 4 
code indicates whether 

the program is running cals Mt al DMEM: Memory 
normally or some special PC 


event has occurred. RENE 
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about the effect of the most recent arithmetic or logical instruction. The program 
counter (PC) holds the address of the instruction currently being executed. 

The memory is conceptually a large array of bytes, holding both program 
and data. Y86-64 programs reference memory locations using virtual addresses. 
A combination of hardware and operating system software translates these into 
the actual, or physical, addresses indicating where the values are actually stored 
in memory. We will study virtual memory in more detail in Chapter 9. For now, 
we can think of the virtual memory system as providing Y86-64 programs with an 
image of a monolithic byte array. 

A final part of the program state is a status code Stat, indicating the overall 
state of program execution. It will indicate either normal operation or that some 
sort of exception has occurred, such as when an instruction attempts to read 
from an invalid memory address. The possible status codes and the handling of 
exceptions is described in Section 4.1.4. 


4.1.2 Y86-64 Instructions 


Figure 4.2 gives a concise description of the individual instructions in the Y86-64 
ISA. We use this instruction set as a target for our processor implementations. The 
set of Y86-64 instructions is largely a subset of the x86-64 instruction set. It includes 
only 8-byte integer operations, has fewer addressing modes, and includes a smaller 
set of operations. Since we only use 8-byte data, we can refer to these as “words” 
without any ambiguity. In this figure, we show the assembly-code representation 
of the instructions on the left and the byte encodings on the right. Figure 4.3 shows 
further details of some of the instructions. The assembly-code format is similar to 
the ATT format for x86-64. 
Here are some details about the Y86-64 instructions. 


* The x86-64 movq instruction is split into four different instructions: irmovq, 
rrmovq, mrmovq, and rmmovq, explicitly indicating the form of the source and 
destination. The source is either immediate (i), register (r), or memory (n). 
It is designated by the first character in the instruction name. The destination 
is either register (r) or memory (m). It is designated by the second character 
in the instruction name. Explicitly identifying the four types of data transfer 
will prove helpful when we decide how to implement them. 

The memory references for the two memory movement instructions have 
a simple base and displacement format. We do not support the second index 
register or any scaling of a register's value in the address computation. 

As with x86-64, we do not allow direct transfers from one memory loca- 
tion to another. In addition, we do not allow a transfer of immediate data to 
memory. 3 
There are four integer operation instructions, shown in Figure 4.2 as OPq. 
These are addq, subq, andq, and xorq. They operate only on register data, 
whereas x86-64 also allows operations on memory data. These instructions 
set the three condition codes ZF, SF, and OF (zero, sign, and overflow). 











Byte 
halt 

nop 

rrmovq rA, rB 
irmovg V, rB 
rmmovq rA, D(rB) 
zrmovq D(rB), rA 
OPq rA, rB 

xx Dest 

cmovXX rA, rB 
cell Dest 

ret 

pushq rA 


popq frA 
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Figure 4.2 Y86-64 instructjon set. instruction encodings range between 1 and 10 

bytes. An instruction consists of a 1-byte instruction specifier, possibly a 1-byte register 
specifier, and possibly an 8-byte constant word. Field fn specifies a particular integer 
operation (OPq), data movement,condition (cmovXX), or branch condition (jXX). All 

numeric values are shown in hexadecimal. 


* The seven jump instructions (shown in Figure 4.2 as,jXX) are jmp, jle, jl, je, 
jne, jge, and jg. Branches are taken according to the type of branch and the 
settings of the condition codes. The branch conditions are the same as with 
x86-64 (Figure 3.15). 


* There are six conditional move instructions (shown in Figure 4.2 as cmovXX): 
cmovle, cmovl, cmove, cmovne, cmovge, and cmovg. These have the same 
format as the régister-register move instruction rrmovq, but the destination 
register is updated only if the condition codes satisfy the required constraints. 


* The call instruction pushes the return address on the stack and jumps to the 
destination address. The ret instruction returns from such a call. 


* The pushq and popq instructions implement push and pop, just as they do in 


x86-64. 


* The halt instruction stops instruction execution. x86-64 has a comparable 
instruction, called h1t. x86-64 application programs are not perthitted to'use 
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this instruction, since it causes the entire system to suspend operation. For 
Y86-64, executing the halt instruction causes the processor to stop, with the 
status code set to HLT. (See Section 4.1.4.) 


4.1.3 Instruction Encoding 


Figure 4.2 also shows the byte-level encoding of the instructions. Each instruction 
requires between 1 and 10 bytes, depending on which fields are required. Every 
instruction has an initial byte identifying the instruttion type. This byte is split 
into two 4-bit parts: the high-order, or code, part, and the low-order, or function, 
part. As can be seen in Figure 4.2, code values range from 0 to OxB. The function 
values are significaut only for the cases where a group of related instructions share 
a common code. These are given in Figure 4.3, showing the specific encodings of 
the integer operation, branch, and conditional move instructions. Observe that 
rrmovq has the same instruction code as the conditional moves. It can be viewed 
as an “unconditional move” just as the jmp instruction is an unconditional jump, 
both having function code 0. 

As shown in Figure 4.4, each of the 15 program registers has an associated 
register identifier (ID) ranging from 0 to OxE. The numbering of registers in Y86- 
64 matches what is used in x86-64. The program registers are stored within the 
CPU in a register file, a small random access memory where the register IDs serve 
as addresses. ID value OxF is used in the instruction encodings and within our 
hardware designs when we need to indicate that no register should be accessed. 

Some instructions are just 1 byte long, but those that require operands have 
longer encodings. First, there can be an additional register specifier byte, specifying 
either one or two registers. These register fields are called rA and rB in’ Figure 
4.2. As the assembly-code versions of the instructions show, they can specify the 
registers used for data sources and destinations, as well as the base register used in 
an address computation, depending on the instruction type. Instructions that have 
no register operands, such as branches and cal11, do not have a register specifier 
byte. Those that require just one register operand (irmovq, pushq, and popa) have 


Operations Branches Moves 


addq| 6 | 0 | imp| 7] 0 | Ears rmovg| 2| 0} cmovne| 2 | 4] 
subq! 6 | 1 | jre{7] a] see[7]5] cnovie| 2] 1] cmovga| 2 | 5 | 
andq| 6 | 2 | jl an cmovi cnovg | 2! 6 | 
xorq| 6 | 3 | je 17 cmove 


Figure 4.3 Function codes for Y86-64 instruction set. The code specifies a particular 
integer operation, branch condition, or data transfer condition. These instructions are 
shown as OPq, jXX, and cmovXX in Figure 4.2. 
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Number Register name Number Register name 
0 %rax 8 4x8 
1 Arcx 9 ‘x9 
2 drdx A 4r10 
3 Arbx B Aril 
4 “rsp C 4x12 
5 “rbp D %r13 
6 Arsi E %r14 
7 žrdi F No register 


Figure 4.4 Y86-64 program register identifiers. Each of the 15 program registers 
has an associated identifier (ID) ranging from 0 to OxE. ID OxF in a register field of an 
instruction indicates the absence of a register operand. 


the other register specifier set to value OxF. This convention will prove useful in 
our processor implementation. 

Some instructions require an additional 8-byte constant word. This word can 
serve as the immediate data for irmovq, the displacement for rmmovq and mrmovq 
address specifiers, and the destination of branches and calls. Note that branch and 
call destinations are given as absolute addresses, rather than using the PC-relative 
addressing seen in x86-64. Processors use PC-relative addressing to give more 
compact encodings of branch instructions and to allow code to be shifted from 
one part of memory to another without the need to update all of the branch target 
addresses. Since we are more concerned with simplicity in our presentation, we 
use absolute addressing. As with x86-64, all integers have a little-endian encoding. 
When the instruction is written in disassembled form, these bytes appear in reverse 
order. 

As an éxample, let us generate the byte encoding of the instruction rmmovq 
arsp ,0x123456789abcd (%rdx) in hexadecimal. From Figure 4.2, we can see that 
rmmovq has initial byte 40. We can also see that source register 4rsp should be 
encoded in the rA field, and base register 4rdx should be encoded in the rB field. 
Using the register numbers in Figure 4.4, we get a register specifier byte of 42. 
Finally, the displacement is encoded in the 8-byte constant word. We first pad 
0x123456789abcd with leading zeros to fill out 8 bytes, giving a byte sequence of 
0001 23 45 67 89 ab cd. We write this in byte-reversed order as cd ab 89 67 45 23 01 
00. Combining these, we get an instruction encoding of 4042cdab896745230100. 

One important property of any instruction set is that the byte encodings must 
have a unique interpretation. An arbitrary sequence of bytes either encodes a 
unique instruction sequence or is not a legal byte sequence. This property holds for 
Y86-64, because every instruction has a unique combination of code and function 
in its initial byte, and given this byte, we can determine the length and meaning of , 
any additional bytes. This property ensures that a processor can execute an object- 
code program without any ambiguity about the meaning of the code. Even if the 
code is embedded within other bytes in the program, we can readily determine 
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Aside Comparing x86-64 to Y86-64 instructign’ encagipigs UC 
Fc Compared with the instruction ericodings used in x86-64;the encoding of Y86-64 1 i$ innch simpler But i 
also less Compact. The register fields occur only i in fixed positichs i in all Y86-64 , instructions, whereas 
|] they are packed into various positions, in, the, different x86:64 instructions. An x86-64 instruction can i 
S | | encode constant values jn 1, 2, 4, or 8 bytes, whereas Y86:64 always-requires 8 bytés- " i 
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T 
: | the instruction sequence as long as we start from the first byte in the sequence. 
i On the other hand, if we do not know the starting position of a code sequence, we 
cannot reliably determine how to split the sequence into individual instructions, 
This causes problems for disassemblers and other tools that attempt to extract 
machine-level programs directly from object-code byte sequences. 





4 Determine the byte Sco dine of the Y86-64 instruction sequence that follows. s The 
EM line .pos 0x100 indicates that the starting address of the object code should be 
EM - 0x100. 


.pos 0x100 # Start code at address 0x100 
irmovq $15,%rbx 
i rrmovq %rbx, 4rcex 






P o, loop: 
| rmmovqd Wrcx,-3(Arbx) i 
| 3 addq %rbx,%rcx 
| = jmp loop 
| E 

u Brattice 200 Ere PEE OP m "EE 
| $ For each byte sequence listed, determine the Y86-64 instruction sequence it en- 
| 4 | codes, If there is some invalid byte i in the sequence, show the instruction sequence 

up to that point and indicate where the invalid value occurs. For each sequence, 


f we show the starting address, then a colon, and then the ¢,byte sequence. 
| A. 0x100: SOf3fcffffffffffffÍff40630008000000000000 
B. 0x200: a06£800c020000000000000030f30a00000000000000 

. 0x300: 5054070000000000000010f0b01f 

. 0x400: 611373000400000000000000 


. 0x500: 6362a0f0 
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Aside RISC and CISC instruction sets 


x86-64 is sometimes labeled as'a “complex instruction set computer" (CISC—pronounced “sisk”), 
and is deemed to be the oppone of ISAs that are classified as “reduced instruction set computers” 
CISC machines came first, having evolved from the earliest 
computers. By the early 1980s, instruction sets for mainframe and minicomputers had grown quite large, 
as machine designers incorporated new instructions to support high-level tasks, such as manipulating 
circular buffers, performing decimal arithmetic, and evaluating polynomials. The first microprocessors 
appeared in the early 1970s and had limited instruction sets, because the integrated-circuit technology 
then pased severe constraints on what could be implemented on a single chip, Microprocessors evolved 
quickly and, by the early 1980s, were following the same path of increasing instruction set complexity 
that had been the case for mainframes and minicomputers. The x86 family took this path, evolving into 
IA32, and more recently into x86-64. Thé x86 line continues to evolve as new classes of instructions are 
added based of the needs of emerging applications. 

The RISC design philosophy developed in the early 1980s as an alternative to these trends. A group 
of hardware and" compiler experts at IBM, strongly influenced by the ideas of IBM researcher John 
Cocke, recognized that they could generate efficient code for a much simpler form of instruction set. In 
fact, many of the high-level instructions that were being added to instruction sets, were very difficult to 
generate with a compiler and were seldom used. A simpler instruction set could be implemented with 
much less hardware and could be organized in an efficient pipeline structure, similar to those described 
later in this chapter. IBM did not commercialize this idea until many years later, when it developed the 
Power and PowerPC ISAs. 

The RISC concept was further developed by Professors David Patterson, of the University of 
California at Berkeley, and John Hennessy, of Stanford University. Patterson gave the name RISC to 
this new class of machines, and CISC to the existing class, since there had previously been no need to 
have a special designation for a nearly universal form of instruction set. 

When comparing CISC with the original RISC instruction sets, we find the following general 
characteristics: 





CISC Early RISC 


A large number of instructions. The Intel Many fewer instructions—typically less than 100. 
document describing the complete set of 
instructions [51] is over 1,200 pages long. 





Some, instructions: with long execution times. No instruction with a long executidn time. Some 
These include instructions that copy an entire early RISC machines did not even have an 
block from one part of memory to another and integer multiply instruction, requiring compilers 
others that copy multiple registers to and from to implement multiplication as a sequence of 
memory. additions. 

Variable-size encodings. x86-64 instructions can Fixed-length encodings. Typically all instructions 


range from 1 to 15 bytes. are encoded as 4 bytes. 


w 
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Aside RISC and CISC instruction sets (continued) 


CISC 


Multiple formats for specifying operands, In x86- 
64, a memory operand specifier can have many 
different combinations of displacement, base 
and index registers, and scale factors. 


Arithmetic and logical operations can be applied 
to both memory and register operands. 


Implementation artifacts hidden from machine- 
level programs. The ISA provides a clean 
abstraction between programs and how they 
get executed. 


Condition codes. Special flags are set as a 
side effect of instructions and then used for 
conditional branch testing. 


Stack-intensive procedure linkage. The stack 
is used for procedure arguments and return 
addresses. 


The Y86-64 instruction set includes attributes of both CISC and RISC instruction sets. On the 
CISC side, it has condition codes and variable-length instructions, and it uses the stack to store return 
addresses. On the RISC side, it uses a load/store architecture and a regular instruction encoding, and it 
passes procedure arguments through registers. It can be viewed as taking a CISC instruction set (x86) 
and simplifying it by applying some of the principles of RISC. 


Early RISC 


Simple addressing formats. Typically just base 
and displacement addressing. . 


Arithmetic and logical operations only use 
register operands. Memory referencing is only 
allowed by /oad instructions, reading from 
memory into a register, and store instructions, 
writing from a register to memory. This 
convention is referred to as a load/store 
architecture. 


Implementation artifacts exposed to machine- 
level programs. Some RISC machines prohibit 
particular instruction sequences and have 
jumps that do not take effect until the following 
instruction is executed. The compiler is given 
the task of optimizing performance within these 
constraints. 


No condition codes. Instead, explicit test 
instructions store the test results in normal 
registers for use in conditional evaluation. 


Register-intensive procedure linkage. Registérs 
are used for procedure arguments and return 
addresses. Some procedures can thereby avoid 
any memory references. Typically, the processor 
has many more (up to 32) registers. 
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Aside The RISC versus CISC controversy 


Through the 1980s, battles raged in the computer architecture community re garding the merits of RISC 
versus CISC instruction sets. Proponents of RISC clainied they could get more computing: power for 
a given amount of hardware through a combination’ of ‘streamlined instruction set design, advanced 
compiler technology, and pipelined processor inipléwientdtion’ CISC proponents countered that fewer 
CISC instructions were required to perform a given (ask, and so their machines could achieve higher 
overall performance. l 

Major companies introduced RISC processor lines, including Sun Microsystems (SPARC), IBM 
and Motorola (PowerPC), and Digital Equipment Corporation (Alpha). A British company, Acorn 
Computers Ltd., developed its own architecture, ARM (originally an acronym for “Acorn RISC 
machine", which has becomé widely used in embedded applications, such as cell phones. 

In the early 1990s, the debate diminished as it becamé clear that neither RISC nor CISC in their 
purest fornis weré bettér than designs that incorporated the best ideas of both. RISC machines evolved 
and introduced more ifistructions, many of Which take inültiple cycles to execute. RISC machines 
today have’ hundreds of instrüctions in their repertoire, hardly fittitig the namé “reduced instruction 
sèt machine.” The idea of exposing implementation : artifacts to thathiné- level programs proved to be 
shortsighted. As' new procéssor models weré developed using more ‘advanced’ hardware structures, 
many of these artifacts became irrelevant, but they still remained part of tlie instruction set. Still, the 
core of RISC désigh i is an instruction set'that'is wéll shifed’ to execution on a pipelined machine. ' 

More récent CISC machines also take atlvahtage of high- -performancé pipeline structures. As we i 
will discuss in' Section 5.7, they’ fetth the CISC instructions &nd dynamically translate them into a 

"sequence of simpler, RISC-like operations. For example, a an instruction that adds a register to memory 
is translated iríto three opérations: one to fead the otiginal memory value, one to perform the addition, 
and a third-to write the suni to memory. Since thé dyfiamic tr&rislation cari | penerally be performed well t 





| 
| 
| 


in advance of the actual instruction execution, the processor can sustain a very high execution rate. 
Marketing issues, apart from technological ones, have also played a majot,role-in determining the 
success of different instruction sets. By maintaining compatibility with its existing processors, Intel with 
x86 made if;casy to'keep moving from Öne generation of procéssor to'the next::As integrated-circuit 
technology improved, Intel and other x86 processor manufacturers could overcome the inefficiencies : 
created by the original 8086 instruction set design, using RISC téchniques to produce performance 
comparable to the best RISC machines. As we saw in Section 3.1, the evolution of IA32 into x86-64 
provided an opportunity to incorporate several features of RISC iñto' "the x86 family. In the areas of 
desktop, laptop, and server-based computing, x86 has achiéVed near total domination. ; 
RISC processors have done very well in the market for emibédded processors, controlling such ! 
systems as cellulartelephones, automobile brakes, and Internet appliances. In these applications, saving 4 
on cost and power is more important than maintaining backward compatibility. In terms of the number 
of processors sold, this is a very large and growing market. 
! 
t 


4.1.4 Y86-64 Exceptions 


The programmer-visible state for Y86-64 (Figure 4.1) includes a status code Stat 
describing the overall state of the executing program. The possible values for this 
code are shown in Figure 4.5. Code value 1, named AOK, indicates that the program ! 
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a 


Value Name Meaning 


1 AOK Normal operation 

2 HLT halt instruction encountered 

3 ADR Invalid address encountered 

4 INS Invalid instruction encountered 


Figure 4.5 Y86-64 status codes. In our design, the processor halts for any code other 
than AOK. 


is executing normally, while the other codes indicate that some type of exception 
has occurred. Code 2, named HLT, indicates that the processor has executed a halt 
instruction. Code 2, named ADR, indicates that the processor attempted to read 
from or write to an invalid memory address, either while fetching an instruction 
of while reading or writing data. We limit the maximum address (the exact limit 
varies by implementation), and any access to an address beyond this limit will 
trigger an ADR exception. Code 4, named INS, indicates that an invalid instruction 
code has been encountered. 

For Y86-64, we will simply have the processor stop executing instructions 
when it encounters any of the exceptions listed. In a more complete design, the 
processor would typically invoke an exception handler, a procedure designated 
to handle the specific type of exception encountered. As described in Chapter 8, 
exception handlers can be configured to have different effects, such as aborting 
the program or invoking a user-defined signal handler. 


1 


4.1.5: Y86-64 Programs 
Figure 4.6 shows x86-64 and Y86-64 assembly code for the following C function: 


1, long sum(long *start, long count) 
2 fí 

3 long sum = 0; 

4 while (count) f 

5 sum += *start; 

6 





startt++; à 
7 count--; 
8 } 
9 return sun; 
0 


} 


The x86-64 code was generated by the GCC compiler. The Y86-64 code is 
similar, but with the following differences: 


« The Y86-64 code loads constants into registers (lines 2-3), since it cannot use i 
immediate data in arithmetic instructions. À 
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x86-64 code 


long sum(long *start, long count) 
start in Zrdi, count in %rsi 
sum: t 


“movl $0, %eax sum = 0 
jmp .L2 Goto test 
.L3: loop: 
addq (rdi), Wrax Add *start to sum 
addq $8, žrdi startt++ 
subq $1, %rsi count-- 
.L2: test: 
testą &rsi, %rsi Test sum 


Oo OO N QN i^ d WH A 


jne .L3 If !-0, goto loop 
rep; ret Return 


Y86-64 code 


long'8um(long *start, long count) 
start in rdi, count in {rsi 


sum: 
irmovq $8,%r8 Constant 8 
irmovq $1,%r9 Constant 1 
xorg Arax,^4rax sum = 0 
andq %rsi,%rsi Set CC 
jmp test Goto test 

loop: 
mrmovq (irdi) ,%ri0 Get *start 
addq %ri0, %rax Add to sum 
addq 4x8 , frdi start++ 


2 
3 
4 
5 
6 
7 
8 
9 


—- = 
= o 


subq %r9,%rsi count--. Set CC 
test: 
jne loop Stop when O 


ret Return 
| Figure 4.6 Comparison of Y86-64 and x86-64 assembly programs. The šum function 


F computes the sum of an integer array. The Y86-64 code follows the same general pattern 
E as the x86-64 code. 
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* The Y86-64 code requires two instructions (lines 8-9) to read a value from 
memory and add it to a register, whereas the x86-64 code can do this with a 
single addq instruction (line 5). 

* Our hand-coded Y86-64 implementation takes advantage of the property that 
the subq instruction (line 11) also sets the condition codes, and so the testq 
instruction of the ccc-generated code (line 9) is not required. For this to work, 
though, the Y86-64 code must set the condition codes prior to entering the 
loop with an andq instruction (line 5). 


Figure 4.7 shows an example of a complete program file written in Y86- 
64 assembly code. The program contains both data and instructions. Directives 
indicate where to place code or data and how to align it. The program specifies 
issues such as stack placement, data initialization, program initialization, and 
program termination. 

In this program, words beginning with ‘.’ are assembler directives telling the 
assembler to adjust the address at which it is generating code or to insert some 
words of data. The directive .pos 0 (line 2) indicates that the assembler should 
begin generating code starting at address 0. This is the starting address for all 
Y86-64 programs. The next instruction (line 3) initializes the stack pointer. We 
can see that the label stack is declared at the end of the program (line 40), to 
indicate address 0x200 using a .pos directive (line 39). Our stack will therefore 
start at this address and grow toward lower addresses. We must ensure that the 
stack does not grow so large that it overwrites the code or other program data. 

Lines 8 to 13 of the program declare an array of four words, having the values 





B. | 0x000d000d000d000d, 0x00c000c000c000c0, 
; Ox0b000b000b000b00, 0xa000a000a000a000 





H 
3 The label array denotes the start of this array, and is aligned on an 8-byte boundary 
j d (using the .align directive). Lines 16 to 19 show a “main” procedure that calls 
' the function sum on the four-word array and then halts. 
| As this example shows, since our only tool for creating Y86-64 code is an 
3 assembler, the programmer must perform tasks we ordinarily delegate to the 
B compiler, linker, and run-time system. Fortunately, we only do this for small 
4 1 programs, for which simple mechanisms suffice. 
, Figure 4.8 shows the result of assembling the code shown in Figure 4.7 by an 
] assembler we call vas. The assembler output is in ASCII format to make it more 
4 readable. On lines of the assembly file that contain instructions or data, the object 
; code contains an address, followed by the values of between 1 and 10 bytes. 
Er We have implemented an instruction set simulator we call vis, the purpose 
| D of which is to model the execution of a Y86-64 machine-code program without 
] attempting to model the behavior of any specific processor implementation. This 
| of form of simulation is useful for debugging programs before actual hardware is 
available, and for checking the result of either simulating the hardware or running 
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# Execution begins at address 0 
.pos 0 
irmovq stack, %rsp * Set up stack pointer 
call main # Execute main program 
halt # Terminate program 


of 4 elements 
-align 8 


wo ON 0 ^ Bw DH = 


= 
e 


.quad 0x000d000d000d 
-quad 0x00c000c000c0 
-quad 0x0b000b000b00 
-quad 0xa000a000a000 


= = = aw c 
wv A w N AS 


irmovq array,%rdi 

irmovq $4,%rsi 

call sum : # sum(array, 4) 
ret' 


* long sum(long *start, long count) 

# start in %rdi, count in %rsi 

sum: 
irmovq $8,%r8 Constant 8 
irmovq $1,%r9 Constant 1 
xorg %4rax, 4rax sum = 0 
andq %rsi,&%rsi Set CC 
jmp test Goto test 


mrmovq (4rdi) ,%r10 Get *start 

addq %r10,%rax Add to sum 

addq %r8,%rdi start 

subq %r9,%rsi count--. Set CC 


jne loop Stop when O 
ret Return 


f Stack starts here and grows to lower addresses 
.pos Ox200 
stack: 


Figure 4.7 Sample program written in Y86-64 assembly code. The sum function is 
called to compute the sum of a four-element array. 
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| # Execution begins at address 0 





0x000: I .pos 0 
0x000: 30£40002000000000000 | | irmovq stack, %rsp 4 Set up stack pointer 
Ox00a: 803800000000000000 | call main # Execute main program 
Ox013: 00 | halt # Terminate program 
| 
| # Array of 4 elements 
K 0x018: Í .align 8 
i 0x018: | array: 
j 0x018: Od000d000d000000 | .quad 0x000d000d000d 
| 0x020: c000c000c0000000 | .quad 0x00c000c000c0 
X 0x028: 000b000b000b0000 | .quad 0x0b000b000b00 
a 0x030: 00a000a000200000 |  .quad 0xa000a000a000 
: I 
T 0x038: | main: 
] 


0x038: 30£71800000000000000 irmóvq array ,4rdi 


0x042: 30£60400000000000000 irmovq $4,%rsi 


| 
l 
: a Ox04c; 805600000000000000 | call sum # sum(array, 4) 
| 0x055: 90 } ret 
| 1 | 
| . | # long sum(long *start, lóng count) 
| : { # start in žrdi, count ir %rsi 
| [ 0x056: | sum: 
| , 0x056: 30f80800000000000000 |  irmovq $8,4r8 # Constant 8 
| i 0x060: 30f90100000000000000 |  irmovq $1,4r9 # Constant 1 
| Od: 0x06a: 6300 | xorg %rax,4rax # sum = 0 
q ] Ox06c: 6266 |  andq Wrsi,Arsi # Set CC 
Ox06e: 708700000000000000 Í jmp test # Goto test 
| 0x077; | loop: 
i Ox077: 50a70000000000000000 | mrmovq ({rdi) ,4r10 # Get *start 
i 0x081: 60a0 j addg %r10,%rax # Add to sum 
; 0x083: 6087 | addg %r8,%rdi # start++ 
1 0x085: 6196 | subq %r9,4rsi # count--. Set CC 
0x087 : | test: 
| 0x087: 747700000000000000 { jne loop # Stop when 0 
0x090: 90 | ret # Return 
1 J 
j ! | & Stack starts here and grows to lower addresses* 
; 0x200: | .pos 0x200 
Pe t 0x200: | stack: 


Figure 4.8 Output of YAS assembler. Each line includes a hexadecimal address and between 1'and 10 bytes 
of object code. 
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the program on the hardware itself. Running on our sample object code, vis 
generates the following output: 


Stopped in 34 steps at PC = 0x13. Status 'HLT', CC Z=1 S=0 0-0 


Changes 
{rax: 
“rsp: 
žrdi: 
4r8: 
4x9: 
4r10: 


Changes 
0x01f0: 
0x01£8: 


to registers: 

0x0000000000000000 
0x0000000000000000 
0x0000000000000000 
0x0000000000000000 
0x0000000000000000 
0x0000000000000000 


to memory: 


0x0000000000000000. 


0x0000000000000000 


Ox0000abcdabcdabed 
0x0000000000000200 
0x0000000000000038 
0x0000000000000008 
0x0000000000000001 
0x0000a000a000a000 


0x0000000000000055 
0x0000000000000013 


The first line of the simulation output summarizes the execution and the 
resulting values of the PC and program status, In printing register and memory 
values, it only prints out words that change during simulation, either in registers 
or in memory. The original values (here they are all zero) are shown on the left, 
and the final values are shown on the right. We can see in this output that register 
&rax contains Oxabcdabcdabcdabcd, the sum of the 4-element array passed to 


procedure sum. In addition, we can see that the stack, which starts at address 0x200 
and grows toward lower addresses, has been used, causing changes to words of 
memory at addresses Ox1£0—0x1f8. The maximum address for executable code is 
0x090, and so the pushing and popping of values on the stack did not corrupt the 
executable code. 


3 (solutien paged82) s d eS a E] 

machine-level programs is to add a constant value to a 

register. With the Y86-64 instructions presented thus far, this requires first using an 

irmovq instruction to set a register to the constant, and then an addq instruction to 

add this value to the destination register. Suppose we want to add a new instruction 
iaddq with the following format: 


Byte 0 1 2 3 4 5 6 7 8 9 


mae [TOL ee) o N 


This instruction adds the constant value V to register rB. 

Rewrite the Y86-64 sum function of Figure 4.6 to make use of the iaddq 
instruction. In the original version, we dedicated registers %r8 and %r9 to hold 
constant values. Now, we can avoid using those registers altogether. 
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‘practice Problem | SU 
Write Y86-64 code to implement e a recursive sum fubction rsum, based c on 1 the 
following C code: : 






ap QT. AOV. 
ÆN Solution pade" ABIN ui ts m s 





long rsum(long *start, long count) 


1 
if (count <= 0) 
return 0; 
return *start + rsum(start-*i, count-1); 
} 


Use the same argument passing and register saving conventions as x86-64 code 
does. You might find it helpful to compile the C code on an x86-64 machine and 
then translate the instructions to Y86-64. 





Modify the Y86-64 c code for the sum füuction PRERE 4. $5 to implements a Tren. 
absSum that computes the sum of absolute values of an array. Use a conditional 
jump instruction within your inner loop. 





Modify the Y86- 64 cade for the sum fonction (Figure 4.6) to lemen a aeon 
absSum that computes the sum of absolute values of an array. Use a conditional 
move instruction within your inner loop. 


4.1.6 Some Y86-64 Instruction Details 


Most Y86-64 instructions transform the program state in a straightforward man- 
ner, and so defining the intended effect of each instruction is not difficult. Two 
unusual instruction combinations, however, require special attention. 

The pushq instruction both decrements the stack pointer by 8 and writes a 
register value to memory. It is therefore not totally clear what the processor should 
do when executing the instruction pushq %rsp, since the register being pushed is 
being changed by the same instruction. Two different conventions are possible: 
(1) push the original value of %rsp, or (2) push the decremented value of %rsp. 

For the Y86-64 processor, let us adopt the same convention as is used with 
x86-64, as determined in the following problem. 





Let us ae the behavior of the instruction hand ‘rsp for | an n x86-64 pro- 1 
cessor. We could try reading the Intel documentation on this instruction, but a 
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simpler approach is to conduct an experiment on an actual machine. The C com- 
piler would not normally generate this instruction, so we must use hand-generated 
assembly code for this task. Here is a test function we have written (Web Aside 
ASM:EASM on page 178 describes how to write programs that combine C code with 
handwritten assembly code): 


1 .text 

2 -globl pushtest 

3  pushtest: 

4 movq “rsp, Arax Copy stack pointer 
5 pushq rsp Push stack pointer 
6 popq Ardx Pop it back 

7 subq Wrdx, 4rax Return 0 or 4 

8 ret 


In our experiments, we find that function pushtest always returns 0. What 
does this imply about the behavior of the instruction pushq %rsp under x86-64? 


ak a ee ce 


A similar ambiguity occurs for the instruction popq %rsp. It could either set 
«rsp to the value read from memory or to the incremented stack pointer. As with 
Problem 4.7, let us run an experiment to determine how an x86-64 machine would 
handle this instruction, and then design our Y86-64 machine to follow the same 
convention. 









— 
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The following asembiy-codë function lets us j determime the behavior of the in- 
struction popq %rsp for x86-64: 


1 „text 

2 .globl poptest 

3  poptest: 

4 novq Arsp, Ardi Save stack pointer 

5 pushq $0xabcd Push test value 

6 popq “rsp Pop to stack pointer 

7 movg arep, %rax Set popped value as return value 
8 movq žrdi, %rsp Restore stack pointer 

9 ret 


We find this function always returns Oxabcd. What does this imply about the 
behavior of popq 4rsp? What other Y86-64 instruction would have the exact same 
behavior? 
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Aside Getting the details right: Inconsistencies across x86, models, , p * 


d 


Practice Problems 4.7 and 4.8'ar& designed to help iis devise a consistent set of conventions for instruc- 
tions that push or pop the'stack pointer, There seems tà be little réason why onè would want to perform 
either of these operations, and so:a natural question to ask is, "Why worry about such picky'details?" # 

Several useful lessons can be learned about the importance of consistency from the following 
excerpt from the Intel documentation of the pusa instruction [51]: 


For IA-32 processors from the Intel 286 on;the PUSH ESP instruction pushes the value of the ESP 
register as it existed before the instruction was executed. (This is'alsó true for Intel 64 architecture, 
real-address and virtudl-8086 modes of IA-32 architecture.).For the Intel® 8086 processor, the 
PUSH SP instruction pushes thé new:valué of the SP register (that is thé value,aftér it has been 
decremented by 2). (PUSH ESP instruction. Intel*Gorporation. 50.) 


Although the exact details of this note may be difficult to follow, we-can see that it states that, 
depending on what mode an x86 processor operates under, it will do different things when instructed to 
push the stack pointer register. Some rhodes push the original Value, whilé others push'the decremented 
value. (Interestingly, there is no’corfésponding ainbiguity about popping to the stack pointer register.) 
There are two drawbacks to this inconsistency: 


* It decreases code portfbility, Programs thay, Havé, different behavior depending on,the processor 
mode. Although the particular,instructjon is not at all common, even the potential for incompati- 
bility can have serious consequence’. 

e It complicates the:documentation.-As we see here, a special note is required to try torclarify the 
differences. The docuiientation for'x86 is already complex enough’ Without special cases such as 
this one. 


We conclude, therefore, that working out details in advance and striving for complete consistency can 
save a lot of trouble in the long run, 


a 


4.2 Logic Design and the Hardware Control Language HCL 


In hardware design, electronic circuits are used to compute functions on bits and 
to store bits in different kinds of memory elements. Most contemporary circuit 
technology represents different bit values as high or low voltages on signal wires. 
In current technology, logic value 1 is represented by a high voltage of around 1.0 
volt, while logic value 0 is represented by a low voltage of around 0.0 volts. Three 
major components are required to implement a digital system: combinational logic 
to compute functions on the bits, memory elements to store bits, and clock signals 
to regulate the updating of the memory elements. 

In this section, we provide a brief description of these different components. 
We also introduce HCL (for “hardware control language"), the language that 
we use to describe the control logic of the different processor designs, We only 
describe HCL informally here. A complete reference for HCL can be found in 
Web Aside ARCH:HCL on page 472. 
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Aside Modern logic design 
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At one time, hardwate designers created circuit designs by drawing schematic diagrams of lógic'circüits-* 
(first with paper and pencil, and later with‘ computer graphics.terminals). Nowadays, nist designs! + 
are expressed in a hardware description language (HDL), a textual notation"that'Íooks'sirhilat tó^a 
programming language but that is used to.describe hardware structures rather than program behaviors. 
The most commonly used languages are Verilog, having a syntax similar to C, and VHDL, having 
a syntax similar to the Ada programming language. These languages were originally designed for 
creating simulation models of digital.circuits. In the mjd-1980, researchers developed logic synthesis 
programs that could 'gen&rate 'efficietit circuit design $froht HDL descriptions. Theré are now a number 
of commercial synthesis programs, and this has become the dominant technigue fpr generating digital 
circuits. This shift ffom,Hand-designed circuits £o,syntliesized, ones can be likeried to the shift from 
writing programs in assembly, code to writing them'in a high-level language and having a compiler 
generate the machine code. , . 

Our HEL language expresses only.thé conttol portions of à hárdware desi gh, with only a limited set 
of operations.and with no modularity, As we will séeHiowever; thé control logicis the most difficult part 
of designing amiéroprocessor.,Wehave developed tools that can directly. translate HCL into Verilog, 
and by combining this code with Verilog code for the basic hardware units,-we can generate HDL 
descriptions from, which actual working microprocessors can be, synthesized. By carefully separating 
out, designing, and testing the control logic, ave can cteate. working microprocessor with reasonable 
effort. Web Aside arcil:¥LoG on'page 467 describes how we can generate Verilog versions of a Y86-64 
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Logic gate types. Each 
gate generates output 

equal to some Boolean 
function of its inputs. 





4.2.1 Logic Gates 


Logic gates are the basic computing elements for digital circuits. They generate an 
output equal to some Boolean function of the bit values at their inputs. Figure 4.9 
shows the standard symbols used for Boolean functions AND, OR, and Nor. HCL 
expressions are shown below the gates for the operators in C (Section 2.1.8): && 
for AND, | | for or, and ! for Nor. We use these instead of the bit-level C operators 
&, |, and ~, because logic gates operate on single-bit quantities, not entire words. 
Although the figure illustrates only two-input versions of the AND and or gates, it 
is common to see these being used as n-way operations for n > 2, We still write 
these in HCL using binary operators, though, so the operation of a three-input 
AND gate with inputs a, b, and c is described with the HCL expression a && b ££ c. 

Logic gates are always active. If some input to a gate changes, then within 
some small amount of time, the output will change accordingly. 
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Figure 4.10 Bit equal 
Combinational circuit to 
test for bit equality. The 
output will equal 1 when 
both inputs are 0 or both 
are 1. 





4.2.2 Combinational Circuits and HCL Boolean Expressions 


By assembling a number of logic gates into a network, we can construct computa- 
tional blocks known as combinational circuits. Several restrictions are placed on 
how the networks are constructed: 





Rien abc E 
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i * Every logic gate input must be connected to exactly one of the following: 

EN (1) one of the system inputs (known as a primary input), (2) the output 
connection of some memory element, or (3) the output of some logic gate. | 

* The outputs of two or more logic gates cannot be connected together. Oth- 
erwise, the two could try to drive the wire toward different voltages, possibly 
causing an invalid voltage or a circuit malfunction. 

* The network must be acyclic. That is, there cannot be a path through a series 
of gates that forms a loop in the network. Such loops can cause ambiguity in 
the function computed by the network. 


= Ne eee ye en of! 


Figure 4.10 shows an example of a simple combinational circuit that we will 
find useful. It has two inputs, a and b. It generates a single output eq, such that 
the output will equal 1 if either a and b are both 1 (detected by the upper AND | 
gate) or are both 0 (detected by the lower AND gate). We write the function of this — 4 
network in HCL as i 1 


- 


bool eq = (a && b) || (fa && !b); 


This code simply defines the bit-level (denoted by data type bool) signal eq asa 
function of inputs a and b. As this example shows, HCL uses C-style syntax, with 4% 
‘=’ associating a signal name with an expression. Unlike C, however, we do not 
view this as performing a computation and assigning the result to some memory 
location. Instead, it is simply a way to give a name to an expression. 1 





Write an HCL expression for a signal xor, cual to inii EXCLUSIVE-OR of mea à 4 
and b. What is the relation between the signals xor and eq defined above? 


Figure 4.11 shows another example of a simple but useful combinational E 
circuit known as a multiplexor (commonly referred to as a “MUX”). A multiplexor $ 
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Figure 4.11 

Single-bit multiplexor 
circuit. The output will 
equal input a if the control 
signal s is 1 and will equal 


t 
input b when s is O. 2d 





selects a value from among a set of different data signals, depending on the value 
of a control input signal. In this single-bit multiplexor, the two data signals are the 
input bits a and b, while the control signal is the input bit s. The output will equal 
a when sis 1, and it will equal b when s is 0. In this circuit, we can see that the two 
AND gates determine whether to pass their respective data inputs to the or gate. 
The upper AND gate passes signal b when s is 0 (since the other input to the gate 
is !5), while the lower AND gate passes signal a when s is 1. Again, we can write an 
HCL expression for the output signal, using the same operations as are present in 
the combinational circuit: 


bool out = (s && a) || (!s && b); 


Our HCL expressions demonstrate a clear parallel between combinational 
logic circuits and logical expressions in C. They both use Boolean operations to 
compute functions over their inputs. Several differences between these two ways 
of expressing computation are worth noting: 


* Since a combinational circuit consists of a series of logic gates, it has the 
property that the outputs continually respond to changes in the inputs. If 
some input to the circuit changes, then after some delay, the outputs will 
change accordingly. By contrast, a C expression is only evaluated when it is 
encountered during the execution of a program. 


Logical expressions in C allow arguments to be arbitrary integers, interpreting 
0 as FALSE and anything else as TRUE. In contrast, our logic gates only operate 
over the bit values 0 and 1. 


Logical expressions in C have the property that they might only be partially 
evaluated. If the outcome of an AND or oR operation can be determined by just 
evaluating the first argument, then the second argument will not be evaluated. 
For example, with the C expression 


(a && la) k& func(b,c) 


the function func will not be called, because the expression (a && !a) evalu- 
ates to 0. In contrast, combinational logic does not have any partial evaluation 
rules, The gates simply respond to changing inputs. 
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(a) Bit-level implementation (b) Word-level abstraction 


Figure 4.12 Word-level equality test circuit. The output will equal 1 when each bit 
from word A equals its counterpart from word B. Word-level equality is one of the 


operations in HCL. 


4.2.3 Word-Levei Combinational Circuits and HCL Integer Expressions 


By assembling large, networks of logic gates, we can construct combinational 
circuits that compute much more complex functions. Typically, we design circuits 
that operate on data words. These are groups of bit-level signals that represent an 
integer or some control pattern. For example, our processor designs will contain 
numerous words, with word sizes ranging between 4 and 64 bits, representing 
integers, addresses, instruction codes, and register identifiers. 

Combinational circuits that perform word-level computations are constructed 
using logic gates to compute the individual bits of the output word, based on the 
individual bits of the input words. For example, Figure 4.12 shows a combinationa! 
circuit that tests whether two 64-bit words A and B are equal. THat is, the output 
will equal 1 if and only if each bit of A equals the corresponding bit of B. This 
circuit is implemented using 64 of the single-bit equality circuits shown in Figure 
4.10. The outputs of these single-bit circuits are combined with an AND gate to 
form the circuit output., 

In HCL, we will declare any word-level signal as an int, without specifying 
the word size. This is done for simplicity. In a full-featured hardware description 
language, every word can be declared to have a specific number of bits. HCL allows SE 
words to be compared for equality, and so the functionality of the circuit shown $ 
in Figure 4.12 can be expressed at the word level as 


bool Eq = (A == B); 


where arguments A and B are of type int. Note that we use the same syntax 
conventions as in C, where ‘=’ denotes assignment and ‘==" denotes the equality 


operator. 
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As is shown on the right side of Figure 4.12,;.ve will draw word-level circuits 
using medium-thickness lines to represent the set of wires carrying the individual 
bits of the word, and we will show a single-bit signal as a dashed line. 
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OR circuits from Problem 4.9 rather than from bit-level equality circuits. Design 
such a circuit for a 64-bit word consisting of 64 bit-level EXCLUSIVE-OR circuits and 
two additional logic gates. 


Figure 4.13 shows the circuit for a word-level multiplexor. This circuit gener- 
ates a 64-bit word Out equal to one of the two input words, A or B, depending on 
the control input bit s. The circuit consists of 64 identical subcircuits, each hav- 
ing a structure similar to the bit-level multiplexor from Figure 4.11. Rather than 
replicating the bit-level multiplexor 64 times, the word-level version reduces the 
number of inverters by generating !s once and feusing it at each bit position. 
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[a) Bit-level implementation (b) Word-level abstraction 





tigure 4.13 Word-level multiplexor circuit. The output wili equal input word A when 
he control signal s is 1, and it will equal B otherwise. Multiplexors are described in HCL 
ising case expressions. 
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We will use many forms of multiplexors in our processor designs. They allow 
us to select a word from a number of sources depending on some control condi- 
tion. Multiplexing functions are described in HCL using case expressions. A case 
expression has the following general form: 


[ 
select; : expri; 
select : expro; 
select, :  expry; 
] 


The expression contains a series of cases, where each case į consists of a Boolean 
expression select;, indicating when this case should be selected, and an integer 
expression expr;, indicating the resulting value. 

Unlike the switch statement of C, we do not require the different selection 
expressions to be mutually exclusive. Logically, the selection expressions are eval- 
uated in sequence, and the case for the first one yielding 1 is selected. For example, 
the word-level multiplexor of Figure 4.13 can be described in HCL as 


word Out = [ 
A; 
B: 


3 


S: 
1: 
T 

In this code, the second selection expression is simply 1, indicating that this 
case should be selected if no prior one has been. This is the way to specify a default 
case in HCL. Nearly all case expressions end in this manner. 

Allowing nonexclusive selection expressions makes the HCL code more read- 
able. An actual hardware multiplexor must have mutually exclusive signals con- 
trolling which input word should be passed to the output, such as the signals s and 
ts in Figure 4.13. To translate an HCL case expression into hardware, a logic syn- 
thesis program would need to analyze the set of selection expressions and resolve 
any possible conflicts by making sure that only the first matching case would be 
selected. 

The selection expressions can be arbitrary Boolean expressions, and there can 
be an arbitrary number of cases. This allows case expressions to describe blocks 
where there are many choices of input signals with complex selection criteria. For 
example, consider the diagram of a 4-way multiplexor shown in Figure 4.14. This 
circuit selects from among the four input words A, B, C, and D based on the contro) 
signals s1 and sO, treating the controls as a 2-bit binary number. We can express 
this in HCL using Boolean expressions to describe the different combinations of | 
control bit patterns: 


word Out4 = [ 
isi && !sO : A; # 00 
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Figure 4.14 

Four-way multiplexor. 
The different combinations 
of control signals st and 
s0 determine which data 
input is transmitted to the 
output. 


Out4 





!si B; # 01 
!sÜ ; # 10 
1 D! # if 


The comments on the right (any text starting with # and running for the rest of 
the line is a comment) show which combination of s1 and sO will cause the case to 
be selected. Observe that the selection expressions can sometimes be simplified, 
since only the first matching case is selected. For example, the second expression 
can be written !s1, rather than the more complete !s1&& sO, since the only other 
possibility having s1 equal to 0 was given as the first selection expression. Similarly, 
the third expression can be written as !s0, while the fourth can simply be written 
as 1. 

As a final example, suppose we want to design a logic circuit that finds the 
minimum value among a set of words A, B, and C, diagrammed as follows: 





We can express this using an HCL case expression as 


word Min3 = [ 
A<=B&&A<=C: A; 
B <= A && B <= Ç : B; 








The HCL code given for computing the minimum of three words contains four 
comparison expressions of the form X <= Y. Rewrite the code to compute the 
same result, but using only three comparisons. 











X&Y 





Figure 4.15 Arithmetic/logic unit (ALU). Depending on the setting of the function 
input, the circuit will perform one of four different arithmetic and logical operations. 
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Write HCL code describing a circuit that for word inputs A, B, and C selects the 
median of the three values. That is, the output equals the word lying between the 


minimum and maximum of the three inputs. 





Combinational logic circuits can be designed to perform many different types 
of operations on word-level data. The detailed design of these is beyond the 
scope of our presentation. One important combinational circuit, known as an 
arithmetic/logic unit (ALU), is diagrammed at an abstract level in Figure 4.15. 
In our version, the circuit has three inputs: two data inputs labeled A and B and 
a control input. Depending on the setting of the control input, the circuit will 
perform different arithmetic or logical operations on the data inputs, Observe 
that the four operations diagrammed; for this ALU correspond to the four different 
integer operations supported by the 86-64 instruction set, and the control values 
match the function codes for these instructions (Figure 4.3). Note also the ordering 
of operands for subtraction, where the A input is subtracted from the 8 input. 
This ordering is chosen in anticipation of the ordering of arguments in the subq 


instruction. 


4.2.4 Set Membership 
In our processor designs, we will find many examples where we want to compare 


one signal against a number of possible matching signals, such as to test whether. 4 


the code for some instruction being processed matches some category of instruc- 
tion codes, As a'simple example, suppose we want to generate the signals s1 and 


sO for the 4-way multiplexor of Figure 4.14 by selecting the high- ánd'low-order | 


bits from a 2-bit signal code, as follows: 


* he E 
Control | 


code 





Out4 
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In this circuit, the 2-bit signal code would then control the selection among the 
four data words A, B, C, and D. We can express the generation of signals s1 and sO 
using equality tests based on the possible values of code: 


bool si 
bool sO 


code == 2 || code == 3; 
code == 1 || code == 3; 


tt 


A more concise expression can be written that expresses the property that s1 
is 1 when code is in the set (2, 3}, and sO is 1 when code is in the set (1, 3): 


bool s1 = code in (2, 3 }; 
bool sO = code in ( 1, 3 }; 


The general form of a set membership test is 
iexpr in (expri, iexpro, ... , iexpry) 


where the value being tested (iexpr) and the candidate matches (expr, through 
iexpr,) are all integer expressions. 


4.2.5 Memory and Clocking 


Combinational circuits, by their very nature, do not store any information. Instead, 
they simply react to the signals at their inputs, generating outputs equal to some 
function of the inputs. To create sequential circuits—that is, systems that have state 
and perform computations on that state—we must introduce devices that store 
information represented as bits. Our storage devices are all controlled by a single 
clock, a periodic signal that determines when new values are to be loaded into the 
devices. We consider two classes of memory devices: 


Clocked registers (or simply registers) store individual bits or words. The clock 
signal controls the loading of the register with the value at its input. 


Random access memories (or simply memories) store multiple words, using 
an address to select which word should be read or written. Examples 
of random access memories include (1) the virtual memory system of 
a processor, where a combination of hardware and operating system 
software make it appear to a processor that it can access any word within 
a large address space; and (2) the register file, where. register identifiers 
serve as the addresses. In a Y86-64 processor, the register file holds the 
15 program registers (xax through %r14).. . 


As we can see, the word "register" means two slightly different things when 
speaking of hardware versus machine-language programming. In hardware, a 
register is directly connected to the rest of the circuit by its input and output 
wires. In machine-level programming, the registers represent a small collection 
of addressable words in the CPU, where the addresses consist of register IDs. 
These words are generally stored in the register file, although we will see that the 
hardware can sometimes pass a word directly from one instruction to another to 
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Figure 4.16 Register operation. The register. outputs remain held at the current register 
state until the clock signal rises. When the clock rises, the values at the register inputs are 
captured to become the new register state. 


avoid the delay of first writing and then reading the register file. When necessary 
to avoid ambiguity, we will call the two classes of registers “hardware registers” 
and “program registers,” respectively. 

Figure 4.16 gives a more detailed view of a hardware register and how it 
operates. For most of the time, the register remains in a fixed state (shown as 
x), generating an output equal to its current state. Signals propagate through the 
combinational logic preceding the register, creating a new value for the register 
input (shown as y), but the register output remains fixed as long as the'clock is low. 
As the clock rises, the input signals are loaded into the register as its next state 
(y), and this becomes the new register output until the. next'rising clock edge. A 
key point is that the registers serve as barriers between the combinational logic 
in different parts of the circuit. Values only propagate from a register input to its 
output once every clock cycle at the rising clock edge. Our Y86-64 processors will 
use clocked registers to hold the program counter (PC), the condition codes (CC), 
ahd the program status (Stat). ‘ ae 

The following diagram shows a typical register file: 


E 


&^ “Register 


oe a| yaw 
ae 
w afer OY MLL dst Write port 


me 


This register file has two read poris, named A and B, and one write port, named 
W. Such a multiported random access memory allows multiple read and write 
operations to take place simultaneously. In the register file diagrammed, the circuit 
can read the values of two program registers and update the state of a third. Eách | 
port has an address input, indicating which program register should be selected, 4 
and a data output or input giving a value for that program register. The addresses 3 
ate register identifiers, using the encoding shown in Figure 4.4. The two read ports | 


have address inputs srcA and stcB (short for "source A" and “source B”) and data | 
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outputs valA and valB (short for “value A” and “value B”). The write port has 
address input dstW (short for “destination W”) and data input valW (short for 
“value W”). 

The register file is not a combinational circuit, since it has internal storage. In 
our implementation, however, data can be read from the register file as if it were 
a block of combinational logic having addresses as inputs and the data as outputs. 
When either srcA or srcB is set to some register ID, then, after some delay, the 
value stored in the corresponding program register will appear on either valA or 
valB. For example, setting srcA to 3 will cause the value of program register %rbx 
to be read, and this value will appear on output valA. 

The writing of words to the register file is controlled by the clock signal in 
a manner similar to the loading of values into a clocked register. Every time the 
clock rises, the value on input valW is written to the program register indicated by 
the register ID on input dstW: When dstW is set to the special ID value OxF, no 
program register is written. Since the register file can be both read and written, 
a natural question to ask is, *What happens if the circuit attempts to read and 
write the same register simultaneously?" The answer is straightforward: if the 
same register ID is used for both a read port and the write port, then, as the clock 
rises, there will be a transition on the read port's data output from the old vaiue to 
the new. When we incorporate the register file into our processor design, we will 
make sure that we take this property into consideration. 

Our processor has a random access memory for storing program data, as 
illustrated below: 


data out 





address data in 


This memory has a single address input, a data input for writing, and a data output 
for reading. Like the register file, reading from our memory operates in a manner 
similar to combinational logic: If we provide an address on the address input and 
set the write control signal to 0, then after some delay, the value stored at that 
address will appear on data out. The error signal will be set to 1 if the address 
is out of range, and to 0 otherwise. Writing to the memory is controlled by the 
clock: We set address to the desired address, data in to the desired value, and 
write to 1. When we then operate the clock, the specified location in the memory 
will be updated, as long as the address is valid. As with the read operation, the 
error signal will be set to 1 if the address is invalid. This signal is generated by 
combinational logic, since the required bounds checking is purely a function of 
the address input and does not involve saving any state. 
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Lal tos ^ i x 
Aside Real-life memory design — , " a 


The memory system in a full- -scale microprocessor i is far more corhplex than the simple One we assume 
in our design. It consists of several forms of hardware memories, including ‘several random acééss 
memories, plus nonvolatile memory or maghetic disk, á$ well as a variety of hardwaié ‘and software 
mechanisms for managing thésé: dévices. The design and charactéristics éf the memory system: are 
described in Chapter’6. ` a d 

Nonetheless, our'simple I memory "design, gan bé uséd for smaller systeins, andit provides us‘ with 
an abstraction of the irterfdcé betwéén thé processor and mémóry fot ore complex systems. 

à uk 2 3€ s mh oes 


» * 


Our processor includes an additional read-only memory for reading instruc- 
tions. In most actual systems, these memories are merged into a single memory 
with two ports: one for reading instructions, and the other for reading or writ- 
ing data. 


4.3 Sequential Y86-64 Implementations 


Now we have the components required to implement a Y86-64 processor. As a first 
step, we describe a processor called SEQ (for "sequential" processor). On each 
clock cycle, SEO performs all the steps required to process a complete instruction. 
This would require a very long cycle tíme, however, and so the clock rate would be 
unacceptably low. Our purpose in developing SEQ is to provide a first step toward 
our ultimate goal of implementing an efficient pipelined processor. 


4.3.1 Organizing Processing into Stages 


Ingeneral, processing an instruction involves a number of operations. We organize 
them in a particular sequence of stages, attempting to make all instructions follow 
a uniform sequence, even though the instructions differ greatly in their actions. 
The detailed processing at each step depends on the particular instruction being 
executed. Creating this framework will allow us to design a processor that makes 
best use of the hardware. The following is an informal description of the stages 
and the operations performed within them: 


Fetch. The fetch stage reads the bytes of an instruction from memory, using 
the program counter (PC) as the memory address. From the instruction 
it extracts the two 4-bit portions of the instruction specifier byte, referred 
to as icode (the instruction code) and ifun (the.instruction function). It 
possibly fetches a register specifier byte, giving one or both of the register 
operand specifiers rÀ and rB. It also possibly fetches an 8-byte constant 
word valC. It computes valP to be the address of the instruction following 
the current one in sequential order. That is, valP equals the value of the 
PC plus the length of the fetched instruction. 
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Decode. The decode stage reads up to two operands from the register file, giving 
values valA and/or valB. Typically, it reads the registers designated by 
instruction fields rA and rB, but for some instructions it reads register %rsp. 


Execute. In the execute stage, the arithmetic/logic unit (ALU) either performs 
the operation specified by the instruction (according to the value of ifun), 
computes the effective address of a memory reference, or increments or 
decrements the stack pointer. We refer to the resulting value as valE. The 
condition codes are possibly set. For a conditional move instruction, the 
stage will evaluate the condition codes and move condition (given by ifun) 
and enable the updating of the destination register only if the condition 
holds. Similarly, for a jump instruction, it determines whether or not the 
branch should be taken. 


Memory. The memory stage may write data to memory, or it may read data 
from memory. We refer to the value read as valM. 


Write back. The write-back stage writes up to two results to the register file. 


PC update. The PC is set to the address of the next instruction. 


The processor loops indefinitely, performing these stages. In our simplified im- 
plementation, the processor will stop when any exception occurs—that is, when it 
executes a halt or invalid instruction, or it attempts to read or write an invalid ad- 
dress. Ina more complete design, the processor would enter an exception-handling 
mode and begin executing special code determined by the type of exception. 

As can be seen by the preceding description, there is a surprising amount of 
processing required to execute a single instruction. Not only must we perform 
the stated operation of the instruction, we must also compute addresses, update 
stack pointers, and determine the next instruction address. Fortunately, the overall 
flow can be similar for every instruction. Using a very simple and uniform struc- 
ture is important when designing hardware, since we want to minimize the total 
amount of hardware and we must ultimately map it onto the two-dimensional 
surface of an integrated-circuit chip. One way to minimize the complexity is to 
have the different instructions share as much of the hardware as possible. For 
example, each of our processor designs contains a single arithmetic/logic unit 
that is used in different ways depending on the.type of instruction being exe- 
cuted. The cost of duplicating blocks of logic in hardware is much higher than 
the cost of having multiple copies of code in software. It is also more difficult to 
deal with many special cases and idiosyncrasies in a hardware system than with 
software. 

Our challenge is to arrange the computing required for each of the different 
instructions to fit within this general framework. We will use the code shown in 
Figure 4.17 to illustrate the processing of different Y86-64 instructions. Figures 
4.18 through 4.21 contain tables describing how the different Y86-64 instructions 
proceed through the stages. It is worth the effort to study these tables carefully. 
They are in a form that enables a straightforward mapping into the hardware. 
Each line in these tables describes an assignment to some signal or stored state 
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0x000: 30£20900000000000000 


irmovq $9, %rdx 
irmovg $21, %rbx i 





1 ] 
2  Ox00a: 30£31500000000000000 | i 
3 Ox014: 6123 t subq Aràx, *rbx # subtract 4 
4  0x016: 30£48000000000000000 | irmovq $128,%rsp & Problem 4.13 d 
5 0x020: 40436400000000000000 | rmmovq Arsp, 100(4rbx) # store j 
6 Ox02a: a02f | pushq 4%rdx # push { 
7 = Ox02c: bOOf | popq rax # Problem 4.14 4 
8 0x02e: 734000000000000000 l je done * Not taken 
9 0x037: 804100000000000000 | call proc # Problem 4.18 

10 — Ox040: | done: 

1i  0x040: 00 I halt 

12 Ox041: | proc: 

13  Ox041: 90 i ret # Return 

! 


Figure 4.17 Sample Y86-64 instruction sequence. We will trace the processing of these instructions through 


the different stages. 







J 





(indicated by the assignment operation ‘<-’). These should be read as if they were 
evaluated in sequence from top to bottom. When we later map the computations 
to hardware; we will find that we do-not need to perform these evaluations in strict 
sequential order. 

Figure 4.18 shows the processing required for instruction types OPq (integer 
and logical operations), rrmovq (register-register move), and irmovg (immediate- 
register move). Let us first consider the integer operations. Examining Figure 42, 
we can see that we have carefully chosen an encoding of instructions so that the 
four integer operations (addq, subq, andq, and xorq) all have the same value of 
icode. We can handle them all by an identical sequence of steps, except that the 
ALU computation must be set according to the particular instruction operation, 
encoded in ifun. 

The processing of an integer-operation instruction follows the general pattern 
listed above. In the fetch stage, we do not require a constant word, and so valP 
is computed as PC + 2. During the decode stage, we read both operands. These 
are supplied to the ALU in the execute stage, along with the function specifier 
ifun, so that valE becomes the instruction result. This computation is shown as the 
expression valB OP valA, where OP indicates the operation specified by ifun. Note ] 
the ordering of the two arguments—this order is consistent with the conventions 1 
of Y86-64 (and x86-64). For example, the instruction subg %rax,%rdx is supposed i 
to compute the value R[%rdx] — R[Xrax]. Nothing happens in the memory stage 
for these instructions, but valE is written to register rB in the write-back stage, and 
the PC is set to valP to complete the instruction execution. 

Executing an rrmovq instruction proceeds much like an arithmetic operation. 
We do not need to fetch the second register operand, however. Instead, we set the 
second ALU input to zero and add this to the first, giving valE = valA, which is 1 


















Stage 
Fetch 


Decode 


Execute 


Memory 


Write back 


PC update 


OPq rA, rB 


icode:ifun < M,[PC] 
rA:rB < M,[PC + 1] 


valP < PC+2 


valA <- R[rA] 
valB < Rí[rB] 


valE < vaiB OP valA 
Set CC 


R[rB] < valE 


PC < valP 
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rrmovq rA, rB 


icode:ifun «— M,[PC} 
rA:rB «+ M,[PC +1] 


valP < PC 2 


valÀ < R[rA] 


valE <- 0-rvalA 


R[rB] < valE 


PC «- valP 


irmovq V, rB 
icode:ifun < M,[PC] 
rA:B «- Mj[PC + 1] 
valC < Mg[PC + 2] 
valP < .,PC 4- 10 


valE «+ 0+ valc 


R[rB] < valE 


PC « valP 


387 


Figure 4.18 Computations in sequential implementation of Y86-64 instructions OPq, rrmovq, and 
irmovq. These instructions compute a value and store the result in a register. The notation icode : ifun 
indicates the two components of the instruction byte, while rA : rB indicates the two components of the 


register specifier byte. The notation M,[x] indicates accessin 


location x, while Mg[x] indicates accessing 8 bytes. 


then written to the register file. Similar processing occurs for irmovq, except that 
we use constant value valC for the first ALU input. In addition, we must increment 
the program counter by 10 for irmovq due to the long instruction format. Neither 
of these instructions changes the condition codes. 





Stage 
Fetch 


Decode 


Execute 


Generic 
irmovq V, rB 


icode:ifun < M,[PC] 
rA:rB. «— M,[PC +1] 
valC < Mg[PC + 2] 

valP <- PC--10 


valE < 0,4 valC 


Specific 
irmovq $128, %rsp 






fo tell 
essing of 





g (either reading or writing) 1 byte at memory 
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Aside Tracing the execution of a subd instruction’ 


Ly 
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p: v & A ava + D * “e * ot EJ » 
As an'exainple, let us follow the processing of the subq instruction on line 3 of the object code shown 


in Figuré 4.17. We can,see that the previous two instructions initialize registers %rdx and %rpx to 9 and 


21, respectively: We can also see that the inst 


ruction is located at address 0x014 and consists of 2 bytes, 


having.values 0x61 and 0x23. The stages would proceed as shown in the following table, which lists the 
generic rule for processing an OPq instruction’ (Figure 4.18) on the left, and the computations for this 


specifi¢ instruction on the right. 





Stage OPq rA, rB 

Fetch icode:ifun «- M,[PC] 
rA:rB + My[PC +1] 
valP <- PC+2 

Decode vàlA, «- 'R[TA]- 
valB «- R[rB] 

Execute, valE <> .valB,OP valA 

me Set CC, ” 3} 

“ i E 


a 
Memory, , st" Wo dies 


Write back .R[rB] «- vale 


PC update PC < valP 


‘icode:ifun <- Mi[0x014]— 6:1 


Subq 4rdX, %rbx 


n 
Ar 


rA: B < M,{0x015]= 2:3 


valP < Ox0i4+4 2 = 0x016 


* 


yala «- R[4rdx]:- 9 
valB «- R[4rbx]-—21 


valE <- 21—9 212 
ZF — 0,8F —— Q,U0E < 0, 2 


T 3" ET: 
* 


E $ ai et 
R[%xrbx] <. valE = 12 


PC < valP =0x016 


As this trace shiows, We achiéve the desired effect of setting register %rbx to 12, setting: 4il-three 


condition codes to zero;andáncremehting the PC by 2. ' K 
» 2 * ds * da "o Press % we on rH 
Generic Specific 
Stage irmovq V, rB irmovq $128, %rsp 
Memory 
Write back R[rB] «- valE 
PC update PC < valP 


How does this instruction execution modify the registers and tbe PC? 


Figure 4.19 shows the processing required for the memory write and read in- 
structions rmmovq and mrmovq. We see the same basic flow as before, but using the ; 
ALU to add valC to valB, giving the effective address (the sum of the displacement i 
and the base register value) for the memory operation. In the memory stage, we 
either write the register value valA to memory or read valM from memory. 
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Stage | rmmovq rA, D(rB) mrmovq D(rB), rA 
Fetch icode:ifun + M,[PC] icode:ifun < M,[PC] 
rA:rB «- M,[PC +1] rA:rB «— M,[PC +1] 
valC < Mg[PC +2] valC < Mg[PC + 2] 
valP < PC+10 valP <— PC+10 
Decode valA <- R[rA] 
valB < R[rB] valB <- R[rB] 
Execute valE <- vaiB + valC valE <- valB+ valC 
Memory Ma[valE] < valA valM <- Ma[valE] 
Write back 
R[rA] — valM 
PC update PC <- valP PC < valP 





Figure 4.19 Computations in sequential implementation of Y86-64 instructions 
rmmovq and mrmovq. These instructions read or write memory. 


Figure 4.20 shows the steps required to process pushq and popq instructions. 
These are among the most difficult Y86-64 instructions to implement, because 
they involve both accessing memory and incrementing or decrementing the stack 
pointer. Although the two instructions have similar flows, they have important 
differences. ‘ 

The pushq instruction starts much like our previous instructions, but in the 
decode stage we use %rsp as the identifier for the second register operand, giving 
the stack pointer as value valB. In the execute stage, we use the ALU to decrement 
the stack pointer by 8. This decremented value is used for the memory write 
address and is also stored back to %rsp in the write-back stage. By using valE 
as the address for the write operation, we adhere to the Y86-64 (and x86-64) 
convention that pushq should decrement the stack pointer before writing, even 
though the actual updating of the stack pointer does not occur until after the 
memory operation has completed, 

The popq instruction proceeds much like pushq, except that we read two 
copies of the stack pointer in the decode stage. This is clearly redundant, but we 
will see that having the stack pointer as both valA and valB makes the subsequent 
flow more similar to that of other instructions, enhancing the overall uniformity 
of the design. We use the ALU to increment the stack pointer by 8 in the execute 
stage, but use the unincremented value as the address for the memory operation. 
In the write-back stage, we update both the stack pointer register with the incre- 
mented stack pointer and register rA with the value read from memory. Using the 
unincremented stack pointer as the memory read address preserves the Y86-64 
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Aside Tracing'the execution of an fmmov instruction > 
Let us trace the processing of the rmmovq instruction on line-5 of the object code showsrin Figure 4.17. 5 
We can see that the previous instruction initialized répister %rsp to 128, while %rbx'still holds 12, as § 
computed by the subq instruction (line 3). Wé can also see that the instruction is located at address * 
0x020 and consists of.10 bytes. Thé first 2 bytes have values 0x40 and 0x43,’while the final 8 bytes are 
a byte-reversed version of the number 0x0000000000000064 (decimal 100). The stages would.proceed | 
as follows: 
* "* s 

Generic Specific 
Stage rmmovq JA, D (rB) , , rmmovq Xrsp, 100 CÀrbx) 
Fetch icode:ifun «- M,[PC] icode:ifun +, M,[0x020] = 4:0 

rA:B « M,[PC +1] rA:B + Mij[0x021]— 4:3 

valC +- MgIPC +2] valc «- Ma[0x022]— 100. 

valP + PC+10 valP <- 0x020 + 10 =°0x02a 


Decode valA «- R[rA] vala + R[%rsp] = 128 
valg < R[rB] valB, +- R[%rbx]= 12 


Execute vale <+ valB 4 valC valE «- 12<,100=112, . 
Memory Me[valE] <- valA Milii] < i28 
Write back 


PC update PC“ valP PCr Ox02a ^ 


As this trace shows, the instruction has the.effect of "writing 12 to memory address:112-and à 
iricrementing the PC by 10: " ye t 


‘he » E we PES Risers texrenoscd ein, v 


(and x86-64) convention that popq should first read memory and then increment 
the stack pointer. 


Practice Problem 3,14 t Busine ILI E I Am 
Fill in the right-hand column of the following table to describe the processing of 
the popq instruction on line 7 of the object code in Figure 4.17. 


Generic Specific 
Stage popq rA popq 4rax 


Fetch icode:ifun + Mj[PC] 
rA:rB + Mj[PC + 1] 


valP — PC+2 





Stage pushq rA 
Fetch icode:ifun <- M,[PC] 
rA:rB <- M,[PC +1] 
vaP <— PC +2 
Decode valA < R[rA] 
valB < R[4rsp] 
Execute vale <- valB + (—8) 
Memory + Mg[valE] < valA 
Write back R[%rsp] < valE 
PC update PC < valP 
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popq rA 
icode:ifun «— M,[PC] 
rA:rB « M,[PC+1] 


valP + PC+2 


valA < R[4rsp] 
valB < R[%rsp] 


valE < valB+8 


valM < M,[valA] 


R[Arsp] < valE 
R[rA] <+ valM 


PC < valP 


Figure 4.20 Computations in sequential implementation of Y86-64 instructions 
pushq and popq. These instructions push and pop the stack. 


Generic 
Stage popq rA 
Decode val < R[%rsp] 
valB <- R[%rsp} 
Execute valE < valB+8 
Memory valM < M,[valA] 
Write back R[Arsp] < valE 
R[rA] + valM 
PC update PC < valP 


Specific 


popq rax 


What effect does this instruction execution have on the registers and the PC? 






swar E 






ri'page48o) C SET 


Ear nd 


What would be the: effect of the instruction pushq "sp according to the steps 
listed in Figure 4.20? Does this conform to the desired behavior for Y86-64, as 


determined in Problem 4.7? 
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Aside  Tracing the execution of a pushq instruction 


Let us trace the processing of the pushq instruction on line 6 of the-object code shoYin in Figure 417. ! 
At this point,-we have 9 in register Yrdx and 128 in régister %rsp. We can also see that the instruction is 
located at address 0x02a and consists of 2 bytes having values 0xa0 and 0x2f. The stages would proceed 
as follows: 





Generic Specific | : 
Stage pushq rA pushq Ardx. 
Fetch icode:ifun + M,[PC] icode:ifun + Mj[0x02à] — a:0 

rA:rB < M,[PC +1] rA:iB. «x. M;[0x02b] = 2: f 

valP < PC+2 valP < 0x02a + 2 = 0x02c 
Decode vala <— RIA] - valA «- Ride = 9 i 

valib «- R[Arsp] ValB < R{%rsp] = 128 : i 
Execute valE < valB + (-8) valE < 128 + (—8) ="120 
Memory” Mg[valE] «— .valA Mg[120] + ‘9 ` 

dc # x * 

Write back ^ R[Ársp] < valE R[Érsp] — 120 r 
PC update PC « valP PC <- 0x02c 


As this trace shows, the instruction has the effect of;setting Zrsp to 120, writing 9,to address 120, 
and incrementing the PC by 2. 


d EO ORD xe MO m 


"practice Próblei4 15. Golution peas 4 BN ead ARR 

Assume the two register writes in the write-back stage for popq occur in the order 
listed in Figure 4.20. What would be the effect of executing popq 4rsp? Does this 
conform to the desired behavior for Y86-64, as determined in Problem 4.8? 


Figure 4.21 indicates the processing of our three control transfer instructions: 
the different jumps, call, and ret. We see that we can implement these instruc- 
tions with the same overall flow as the preceding ones. 

As with integer operations, we can process all of the jumps in a uniform 
manner, since they differ only when determining whether or not to take the 
branch. A jump instruction proceeds through fetch and decode much like 
the previous instructions, except that it does not require a register specifier byte. 
In the execute stage, we check the condition codes and the jump condition to de- | 
termine whether or not to take tbe branch, yielding a 1-bit signal Cnd. During the 
PC update stage, we test this flag and set the PC to valC (the jump target) if the | 
flag is 1 and to valP (the address of the following instruction) if the flag is 0. Our 
notation x ? a : b is similar to the conditional expression in C—it yields a when x | 
is 1 and b when x is 0. 
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Stage jXX Dest call Dest ret 
Fetch icode:ifun «— M,[PC] icode:ifun <- M,{[PC] icode:ifun «— M,[PC] 
valC <- Mg[PC +1] valC < Mg[PC + 1] 
valP — PC+9 valP <- PC+9 valP < PC+1 
Decode valA «- R[4rsp] 
valB < R[4rsp] valB < R[%rsp] 
Execute valE <- valB 4- (—8) valE < valB+8 


Cnd <- Cond(CC, ifun) 


Memory Mg[valE] < valP valM < Mé[valA] 
Write back R[%rsp] «— valE R[Zrsp] «— valE 
PC update PC < Cnd?valC : valP PC « valC PC < valM 


Figure 4.21 Computations in sequential implementation of Y86-64 instructions jXX, call, and ret. 
These instructions cause control transfers. 





We can see ety the instruction encodings (Figures 4 2 and 4.3) that the rrmovq 
instruction is the unconditional version of a more general class of instructions 
that include the conditional moves. Show how you would modify the steps for the 
rrmovq instruction below to also handle the six conditional move instructions. 
You may find it useful to see how the implementation of the jXX instructions 
(Figure 4.21) handles conditional behavior. 


Stage cmovXX rA, rB 
Fetch icode:ifun <- Mj[PC] 
rA:rB + M,[PC +1] 
valP < PC+2 
Decode valA «— R[rA] 
Execute valE <- 0+valA 
Memory 
Write back 
R[B] < vale 


PC update PC « valP 
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Aside Tracing the execution of a je instruction = 


Let us trace the processing of the je instruction on line 8 of the object code shown in Figure 4.17. The i 
condition codes were all set tq zero by the sübq instruction (line 3), and so the branch.will not be taken. 
The instruction is located at address 0x02e and consists of 9 bytes. The first has value 0x73, while the 
remaining 8 bytes are a byte-reversed version of the number 0x0000000000000040, the jump target. 
The stages would proceed as follows: 


Generic Specific 
Stage 4XX Dest je 0x040 : 


Fetch icode:ifun + Mj[PC] icode:ifun «- Mj[0xo2é] 7:3 


valC «- Mg[PC + 1] valC — Mg[0x02f] = 0x040 
valP + PC4-9 valP «— Ox02e + 9 = 0x037 


Decode 


Execute 
Cnd & Cond(CC, ifun)" Cnd = €ond((0, 0, Oy, 3) =0 


Memory 
Write back 


PC update PC «— Cnd?valC:valP PC: «— 020x040: 0x037 = 0x037 


‘As this trace shows, the instruction has the effect of incrementing the PC by 9. 


* Ye s 2 ue +m 9 


Instructions call and ret bear some similarity to instructions pushq and popq, 
except that we push and pop program counter values. Witb instruction cali, we 
push valP, the address of the instruction that follows the call instruction. During 
the PC update stage, we set the PC to valC, the call destination. With instruction 
ret, we assign valM, the value popped from the stack, to the PC in the PC update 
stage. 


practice Probi ATS Conteh pads BA Lua Ls a 
Fill in the right-hand column of the following table to describe the processing of 
the call instruction on line 9 of the object code in Figure 4.17: 


comet ue ewm oun qns sten mt quer st, “gee Eg Kh ENTE 
UE um 48 * u^ g eS 
" "C 


Generic Specific 
Stage cali Dest call 0x041 


Fetch icode:ifun «- Mj[PC] 


valC «- Mg[PC +1) 
valP — PC+9 
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Aside  Tracing the execution of a ret instruction. i 


ti p 
Let us trace „the processing of the, ret instruction on "ne 13 of (hà, object code Shown i in Figure 4.17. 
, The instruction address, is Oxp4i andis encoded by a-singlebyte., 0x90. The previous call instruction 
” setdrsp to 120 and stored the return address ( 0x040 at meinory address 120, The stages would proceed 
as follows; 1 . 


x * Hs ats s gr » 
* *" Ceneri c Specific o 
i Stage" ret xs B re ^ y * 
Fetch dca — "Mi[PC]. + " icode:ifun « ! Mi[oxo41] = 9 0. 
Di i 
IP k- XE valk « oxo += "0x042 
* a DSL xo E M T 


Dsgodé “fw  vglA "e N[iep] " 2n MaA € "Rpirsp] 120979 78 
'* o valB & R[%rsp] " Wal — Ries] = 120 


E 
qum a 


s af "] Ca os 
"Execute * -ValE < valB 48 val < 120485 ‘128 ^a 
$ x E Pow E z 

" &£ * . 

Memory: valM <z Mg[valAf ValM* «— Ms[120] —0x640- ; 

Write back* R[%rsp] «- ValE R[4xsp] «— 128 

5 pos * gm 
. PCupdate RG *— valM* » PC «— 0x040 c ; 


x. 


"As this trace shows, the i instruction has the effect of Setting the PC to 0x040, the address of the 
"talt instruction. It also Sets “esp “to 128. 


Generic Specific 
Stage call Dest call 0x041 
Decode 

valB < R[%rsp] 
Execute valE < valB 4- (—8) 
Memory Mg[valE] < valP 


Write back R[4rsp] < valE 


PC update PC < valC 


What effect would this instruction execution have on the registers, the PC, 
and the memory? 


We have created a uniform framework that handles all of the different types of 
Y86-64 instructions. Even though the instructions have widely varying behavior, 
we can organize the processing into six stages. Our task now is to create a hardware 
design that implements the stages and connects them together. 
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4.3.2 SEQ Hardware Structure 


The computations required to implement all of the Y86-64 instructions can be or- 
ganized as a séries of six basic stages: fetch, decode, execute, memory, write back, 
and PC update. Figure 4.22 shows an abstract view of a hardware structure that can 
perform these computations. The program counter is stored in a register, shown 
in the lower left-hand corner (labeled “PC”). Information then flows along wires 
(shown grouped together as a heavy gray line), first upward and then around to 
the right. Processing is performed by hardware units associated with the different 
stages. The feedback paths coming back down on the right-hand side contain the 
updated values to write to the register file and the updated program counter. In 
SEQ, all of the processing by the hardware units occurs within a single clock cycle, 
as is discussed in Section 4.3.3. This diagram omits some small blocks of combi- 
national logic as well as all of the control logic needed to operate the different 
hardware units and to route the appropriate values to the units. We will add this 
detail later. Our method of drawing processors with the flow going from bottom 
to top is unconventional. We will explain the reason for this convention when we 
Start designing pipelined processors. 
The hardware units are associated with the different processing stages: 


n m tr EIU 


Fetch. Using the program counter register as an address, the instruction mem- 
ory reads the bytes of an instruction. The PC incrementer computes valP, 
the incremented program counter. 


| 


Decode. The register file has two read ports, A and B, via which register values 
valA and valB are read simultaneously. 


Execute. The execute stage uses the arithmetic/logic (ALU) unit for different 
purposes according to the instruction type. For integer operations, it per- 
forms the specified operation. For other instructions, it serves as an adder 
to compute an incremented or decremented stack pointer, to comipute 
an effective address, or simply to pass one of its inputs to its outputs by 
adding zero. 

The condition code register (CC) holds the three condition code bits, 
New values for the condition codes are computed by the ALU. When 
executing a conditional move instruction, the decision as to whether or 
not to update the destination register is computed based on the condition 
codes and move condition. Similarly, when executing a jump instruction, 
the branch signal Cnd is computed based on the condition codes and the 


jump type. 
Memory. The data memory reads or writes a word of memory when executing a 


memory instruction. The instruction and data memories access the same 
memory locations, but for different purposes. 


- 


€ m Fm Tum x umm em a a 


Um en rai 


Write back. The register file has two write ports. Port E is used to write values 
computed by the ALU, while port M is used to write values read from thé 
data memory. 
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PC update newPC 
valE, valM 
Write back 
Memory 
1 
Execute 
Decode stcA, srcB 
| | dstE, dstM 
icode, ifun | 
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Figure 4.22 Abstract view of SEQ, a sequential implementation. The information 
processed during execution of an instruction follows a clockwise flow starting with an 
instruction fetch using the*program counter (P€), shown in the lower left-hand corner 
of the figure. 
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PC update. The new value of the program counter is selected to be either 
valP, the address of the next instruction, valC, the destination address 
specified by a call or jump instruction, or valM, the return address read 
from memory. . 


Figure 4.23 gives a more detailed view of the hardware required to implement 
SEQ (although we will notsee the complete details until we examine the individual 
stages). We see the same set of hardware units as earlier, but now the wires are 
shown explicitly. In this figure, as well as in our other hardware diagrams, we use 
the following drawing conventions: 


* Clocked registers are shown as white rectangles. The program counter PC is the 
only clocked register in SEQ. 

* Hardware units are shown as light blue boxes. These include the memories, 
the ALU, and so forth. We will use the same basic set of units for all of our 
processor implementations. We will treat these units as “black boxes" and not 
go into their detailed designs. 

* Control logic blocks are drawn as gray rounded rectangles. These blocks serve 
to select from among a set of signal sources or to compute some Boolean func- 
tion. We will examine these blocks in complete detail, including developing 
HCL descriptions. 

* Wire names are indicated in white circles. These are simply labels on the wires, 
not any kind of hardware element. 

* Word-wide data connections are shown as medium lines. Each of these lines 
actually represents a bundle of 64 wires, connected in parallel, for transferring 
a word from one part of the hardware to another. 


¢ Byte and narrower data connections are shown as thin lines. Each of these lines 
actually represents a bundle of four or eight wires, depending on what type of 
values must be carried on the wires. 

* Single-bit connections are shown as dotted lines, These represent control values 
passed between the units and blocks on the chip, 


All of the computations we have shown in Figures 4.18 through 4.21 have the 
property that each line represents either the computation of a specific value, such 
as ValP, or the activation of some hardware unit, such as the memory. These com- 
putations and actions are listed in the second column of Figure 4.24. In addition 
to the signals we have already described, this list includes four register ID signals: 
srcA, the source of valA; srcB, the source of valB; dstE, the register to which valE 
gets written; and dstM, the register to which valM gets written. 

The two right-hand columns of this figure show the computations for the 
OPq and mrmovq instructions to illustrate the values being computed. To map the 
computations into hardware, we want to implement control logic that will transfer 
the data between the different hardware units and operate these units in such a way 
that the specified operations are performed for each of the different instruction 
types. That is the purpose of the control logic blocks, shown as gray rounded boxes 
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Figure 4.23 Hardware structure of SEQ, a sequential implementation. Some of the 
control signals, as well as the register and control word connections, are not shown. 
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mrmovq D(rB) , rA 











Stage Computation OPq rA, rB 

Fetch icode, ifun icode:ifun + Mj[PC] icode:ifun + Mj[PC] 
rA, rB rA:rB «- M[PC + 1] rA:B < Mj[PC + 1] 
valC valC + Mg[PC +2] 
valP valP < PC+2 valP <- PC+10 








Decode valA, srcA valA + R[rA] 
valB, srcB valB < R[rB] valB < R[rB] 












Execute valE vali «— valB OP valA valë < valB + valC 
Cond. codes Set CC 













Memory Read/write valM < Ma[valE] 









Write back E port, dstE R[rB) < valE 
M port, dstM 








R[rA] + valM 











PC update PC PC «— valP PC «— valP 






Figure 4.24 identifying the different computation steps ín the sequentia] imple- 
mentation. The second column identifies the value being computed or the operation 
being performed in the stages of SEQ. The computations for instructions OPq and mrmovq 
are shown as examples of the computations, 









in Figure 4.23. Our task is to proceed through the individual stages and create 
detailed designs for these blocks. 






4.3.3 SEQ Timing 


i In introducing the tables of Figures 4.18 through 4.21, we stated that they should 
i be read as if they were written in a programming notation, with the assignments 
performed in sequence from top to bottom. On the other hand, the hardware 
structure of Figure 4.23 operates ir a fundamentally different way, with a single 
clock transition triggering a flow through combinational logic to execute an entire 
instruction. Let us see how the hardware can.implement the behavior listed in 
these tables. 

Our implementation of SEQ consists of combinational logic and two forms 
of memory devices: clocked registers (the program counter and condition code 
register) and random access memories (the register file, the instruction memory, 
and the data memory). Combinational logic does not require any sequencing 
or control—values propagate through a network of logic gates whenever the 
inputs change. As we have described, we also assume that reading from a random 
access memory operates much like combinational logic, with the output word 
generated based on the address input. This is a reasonable assumption for smaller 
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memories (such as the register file), and we can mimic this effect for larger circuits 
using special clock circuits. Since our instruction memory is only used to read 
instructions, we can therefore treat this unit as if it were combinational logic. 

We are left with just four hardware. units that require an explicit control 
over their sequencing—the program counter, the condition code register, the data 
memory, and the register file. These are controlled via a single clock signal that 
triggers the loading of new values into the registers and the writing of values to the 
random access memories. The program counter is loaded with a new instruction 
address every clock cycle. The condition code register is loaded only when an 
integer operation instruction is executed. The data memory is written only when 
an rmmovq, pushq, or call instruction is executed. The two write ports of the 
register file allow two program registers to be updated on every cycle, but we can 
use the special register ID OxF as a port address to indicate that no write should 
be performed for this port. 

This clocking of the registers and memories is all that is required to control the 
sequencing of activities in our processor. Our hardware achieves the same effect as 
would a sequential execution of the assignments shown in the tables of Figures 4.18 
through 4.21, even though all of the state updates actually occur simultaneously 
and only as the clock rises to start the next cycle. This equivalence holds because 
of the nature of the Y86-64 instruction set, and because we have organized the 
computations in such a way that our design obeys the following principle: 


PRINCIPLE: No reading back 


The processor never needs to read back the state updated by an instruction in 
order to complete the processing of this instruction. B 


This principle is crucial to the success of our implementation. As an illustra- 
tion, suppose we implemented the pushq instruction by first decrementing %rsp 
by 8 and then using the updated value of %rsp as the address of a write operation. 
This approach would violate the principle stated above. It would require reading 
the updated stack pointer from the register file in order to perform the memory 
operation. Instead, our implementation (Figure 4.20) generates the decremented 
value of the stack pointer as the signal valE and then uses this signal both as the 
data for the register write and the address for the memory write. As a result, it 
can perform the register and memory writes simultaneously as the clock rises to 
begin the next clock cycle. 

As another illustration of this principle, we can see that some instructions (the 
integer operations) set the condition codes, and some instructions (the conditional 
move and jump instructions) read these condition codes, but no instruction must 
both set and then read the condition codes. Even though the condition codes are 
not set until the clock rises to begin the next clock cycle, they will be updated 
before any instruction attempts to read them. 

Figure 4.25 shows how the SEQ hardware would process the instructions at 
lines 3 and 4 in the following code sequence, shown in assembly code with the 
instruction addresses listed on the left: 
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0x000: irmovq $0x100, %rbx Yrbx <-- 0x100 

Ox00a: irmovq $0x200,%rdx Xrdx «-- 0x200 

Ox014: addq Vrdx,Arbx %rbx <-- 0x300 CC «-- 000 
0x016: je dest Not taken 

Ox01f:  rmmovq %rbx,0(%rdx) M[0x200] «-- 0x300 

0x029: dest: halt 


Each of the diagrams labeled 1 through 4 shows the four state elements plus 
the combinational logic and the connections among the state elements. We show 
the combinational logic as being wrapped around the condition code register, 
because some of the combinational logic (such as the ALU) generates the input 
to the condition code register, while other parts (such as the branch computation 
and the PC selection logic) have the condition code register as input. We show the 
register file and the data memory as having separate connections for reading and 
writing, since the read operations propagate through these units as if they were 
combinational logic, while the write operations are controlled by the clock. 

The color coding in Figure 4.25 indicates how the circuit signals relate to the 
different instructions being executed. We assume the processing starts with the 
condition codes, listed in the order ZF, SF, and OF, set to 100. At the beginning of 
clock cycle 3 (point 1), the state elements hold the state as updated by the second 
irmovq instruction (line 2 of the listing), shown in light gray. The combinational 
logic is shown in white, indicating that it has not yet had time to react to the 
changed state. The clock cycle begins with address 0x014 loaded into the program 
counter. This causes the addq instruction (line 3 of the listing), shown in blue, to 
be fetched and processed. Values flow through the combinational logic, including 
the reading of the random access memories. By the end of the cycle (point 2), 
the combinational logic has generated new values (000) for the condition codes, 
an update for program register %rbx, and a new value (0x016) for the program 
counter. At this point, the combinational logic has been updated according to the 
addq instruction (shown in blue), but the state still holds the values Set by the 
second irmovq instruction (shown in light gray). 

As the clock rises to begin cycle 4 (point 3), the updates to -the program 
counter, the register file, and the condition code register occur, and so we show 
these in blue, but the combinationál logic has not yet reacted to these changes, and 
so we show this in white. In this cycle, the je instruction (line 4in the listing), shown 
in dark gray, is fetched and executed. Since condition code ZF is 0, the branch is not 
taken. By the end of the cycle (point 4), a new value of 0x01f has been generated 
for the program counter. The combinational logic has been updated according to 
the je instruction (shown in dark gray), but the state still holds the values set by 
the addq instruction (shown in blue) until the next cycle begins. 

As this example illustrates, the use of a clock to contro! the updating of the 
state elements, combined with the propagation of values through combinational 
logic, suffices to control the computations performed for each instruction in our 
implementation of SEQ. Every time the clock transitions from low to high, the 
processor begins executing a new instruction. 
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0x000: irmovq $0x100,%rbx # %rbx «-- 0x100 
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Figure 4.25 Tracing two cycles of execution by SEQ. Each cycle begins with the state 
elements (program counter, condition code register, register file, and data memory) 
set according to the previous instruction. Signals propagate through the combinational 


logic, creating new values for the state elements. These values are loaded into the state 
elements to start the next cycle. s 
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4.3.4 SEQ Stage Implementations 


In this section, we devise HCL descriptions, for the control logic blocks required 
to implement SEQ. A complete HCL description for SEQ is given in Web Aside 
ARCH:HCL on page 472. We show some example blocks here, and others are given as 
practice problems. We recommend that you work these problems as a way to check 
your understanding of how the blocks relate to the computational requirements 
of the different instructions. 

Part of the HCL description of SEQ that we do not include here is a definition 
of the different integer and Boolean signals that can be used as arguments to the 
HCL operations. These include the names of the different hardware signals, as 
well as constant values for the different instruction codes, function codes, register 
names, ALU operations, and status codes. Only those that must be explicitly 


Name Value (hex) Meaning 


Code for halt instruction 
Code for nop instruction 
Codé for rrmovg instruction 
Code for irmovq instruction 
Code for rmmovq instruction 
Code for mrmovq instruction 
Code for integer operation instructions 
Code for jump instructions 
Code for call instruction 
Code for ret instruction 
Code for pushq instruction 
Code for popq instruction 


IHALT 
INOP 
IRRMOVQ 
IIRMOVQ 
IRMMOVQ 
IMRMOVQ 
IOPL 
IJXX 
ICALL 
IRET 
IPUSHQ 
IPOPQ 


wr oun -100]0,.0Nuf/ ^o 


FNONE Default function code 


RESP Register ID for Zrsp 
RNONE Indicates no register file access 


ALUADD Function for addition operation 


SAOK Status code for normal operation 

SADR Status code for address exception 

SINS Status code for illegal instruction exception 
SHLT Status code for halt ! 


ap ———— 


Figure 4.26 Constant values used in HCL descriptions: These values represent the 
encodings of the instructions, function codes; register IDs, ALU operations, and status 
codes. 
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Figure 4.27 code ifun rA rB valc valP 
SEQ fetch stage. Six 

bytes are read from the 

instruction memory using is 
the PC as the starting 

address. From these bytes, 

we generate the different 
instruction fields. The PC 

increment block computes 

signal vaiP. 





yles 1-9 


[s 





imem error 


referenced in the control logic are shown. The constants we use are documented 
in Figure 4.26. By convention, we use uppercase names for constant values. 

In addition to the instructions shown in Figures 4.18 to 4.21, we include the 
processing for the nop and halt instructions. The nop instruction simply flows 
through stages without much processing, except to increment the PC by 1. The 
halt instruction causes the processor status to be set to HLT, causing it to halt 
operation. 


Fetch Stage 


As shown in Figure 4.27, the fetch stage includes the instruction memory hardware 
unit. This unit reads 10 bytes from memory at a time, using the PC as the address of 
the first byte (byte 0). This byte is interpreted as the instruction byte and is split (by 
the unit labeled “Split”) into two 4-bit quantities. The control logic blocks labeled 
"icode" and "ifun" then compute the instruction and function codes as equaling 
either the values read from memory or, in the event that the instruction address 
is not valid (as indicated by the signal imem error), the values corresponding to 
à nop instruction. Based on the value of icode, we can compute three 1-bit signals 
(shown as dashed lines): 


instr valid. Does this byte correspond to a legal Y86-64 instruction? This signal 
is used to detect an illegal instruction. 

need regids. Does-this instruction include a register specifier byte? 

need valC. Does this instruction include a constant word? 


The signals instr valid and imem, error (generated when the instruction address 
is out of bounds) are used to generate the status code in the memory stage. 
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As an example, the HCL description for need_regids simply determines 
whether the value of icode is one of the instructions that has a register speci- 
fier byte: 


bool need_regids = 
icode in { IRRMOVQ, IOPQ, IPUSHQ, IPOPQ, 
IIRMOVQ, IRMMOVQ, IMRMOVQ }; 





Write HCL code for the signal need: valci in ‘the SEQ Qunolemédition 


As Figure 4.27 shows, the remaining 9 bytes read from the instruction memory 
encode some combination of the register specifier byte and the constant word. 
These bytes are processed by the hardware unit labeled *Align" into the register 
fields and the constant word. Byte 1 is split into register specifiers rA and rB when 
the computed signal need regids is 1. If need regids is 0, both register specifiers 
are Set to OxF (RNONE), indicating there are no registers specified by this instruction. 
Recall also (Figure 4.2) that for any instruction having only one register operand, 
the other field of the register specifier-byte willbe OxF (RNONE). Thus, we can 
assume that the signals rA and rB either encode registers we want to access or 
indicate £hat register accessis not required. The unit labeled "Align" also generates 
the constant word valC. This will either be bytes 1-8 or bytes 2-9, depending on 
the value of signal need regids. 

The PC incrementer hardware unit generates the signal valP, based on the 
current value of the PC, and the two signals need regids and need valC. For PC 
value p, need regids value r, and need valC value i, the incrementer generates 
the value p +1 +r 8i. 


Decode and Write-Back Stages 





Figure 4.28 provides a detailed view of logic that implements both the decode 
and write-back stages in SEQ. These two stages are combined because they both 
access the register file. 

The register file has four ports. It supports up to two simultaneous reads (on 
ports A and B) and two simultaneous writes (on ports E and M). Each port has 
both an address connection and a data connection, where the address connection | 
is a register ID, and the data connection is a set of 64 wires serving as either an 
output word (for a read port) or an input word (for a write port) of the register | 
file. The two read ports have address inputs srcA and srcB, while the two write | 
ports have address inputs dstE and dstM. The special (identifier OxF (RNONE) onan | 
address port indicates that no register should be accessed. 

The four blocks at the bottom of Figure 4.28 generate the four different | 
register IDs for the register file, based on'the instruction code icode, the régister i 
specifiers rA and rB, and possibly the condition signal Cnd computed in the execute § 
stage. Register ID srcA indicates which register should be read to generate valA. : 
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Figure 4.28 Cnd valA valB valM valE 
SEQ decode and write-back i 


stage. The instruction fields are 
decoded to generate register 
identifiers for four addresses (two 
read and two write) used by 
the register file. The values read 
from the register file become the 
signals valA and valB. The two 
write-back,values valE and valM 
serve as the data for the writes. 





icode rA rB 


The desired value depends on the instruction type, as shown in the first row for the 
decode stage in Figures 4.18 to 4.21. Combining all of these entries into a single 
computation gives the following HCL description of srcA (recall that RESP is the 
register ID of %rsp): 


word srcA = [ 
icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : rA 
icode in { IPOPQ, IRET } : RRSP; 
1: RNONE; # Don't need register 


UT 
NT ET 


_ eee 


Register ID dstE indicates the destination register for write port E, where the 
computed value valE is stored. This is shown in Figures 4.18 to 4.21 as the first 
step in the write-back stage. If we ignore for the moment the conditional move 
instructions, then we can combine the destination registers for all of the different 
instructions to give the following HCL description of dstE: 


# WARNING: Conditional move not implemented correctly here 
word dstE = [ 

icode in { IRRMOVQ } : xB 

icode in { IIRMOVQ, IOPQ) : rB 

icode in { IPUSHQ, IPOPQ, ICALL, IRET ) : RRSP; 

1 : RNONE; # Don't write any register 
]; 


We will revisit this signal and how to implement conditional moves when we 
examine the execute stage. 





go (um vt 


Register ID dstM indicates the destina 
the value read from memory, is stored. This is shown in Figures 4.18 to 4.21 as the 
second step in the write-back stage. Write HCL code for dstM. 


Only the popq instruction uses both register file writ 

the instruction popq %rsp, the same address will be used for both the E and M 
write ports, but with different data. To handle this conflict, we must establish a 
priority among the two write ports so that when both attempt to write the same 
register on the same cycle, only the write from the higher-priority port takes place. 
Which of the two ports should be given priority in order to implement the desired 
behavior, as determined in Practice Problem 4.8? 


Execute Stage 


The execute stage includes the arithmetic/logic unit (ALU). This unit performs 
the operation ADD, SUBTRACT, AND, Or EXCLUSIVE-OR On inputs aluA and aluB based 
on the setting of the alufun signal. These data and control signals are generated 
by three control blocks, as diagrammed in Figure 4.29. The ALU output becomes 


the signal valE. 

In Figures 4.18 to 4.21, the ALU computation for each instruction is shown as 
the first step in the execute stage. The operands are listed with aluB first, followed 
by aluA to make sure the subq instruction subtracts valA‘from valB. We can see 
that the value of aluA can be valA, valC, or either —8 or +8, depending on the 
instruction type. We can therefore express the behavior of the control block that 
generates aluA as follows: 


word aluA = Í 
icode in { IRRMOVQ, IOPQ } : val4; 
icode in { IIRMOVQ, IRMMOVQ, IMRMOVQ ) : valc; 


Figure 4.29 

SEQ execute stage. The 

ALU either performs the 

operation for an integer 

operation instruction or 

acts as an adder. The 

condition code registers 

are set according to the 

ALU value. The condition 

code values are tested = 
to determine whether a icode ifun valC val valB 
branch should be taken. 
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icode in { ICALL, IPUSHQ } : - 
icode in { IRET, IPOPQ ) : 8 
# Other instructions don't need ALU 
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Besedo on ‘thé first EER of the first step of the execute stage in Figures 4.18 to 
4.21, write an HCL description for the signal aluB in SEQ. 


Looking at the operations performed by the ALU in the execute stage, we 
can see that it is mostly used as an adder. For the OPq instructions, however, we 
want it to use the operation encoded in the ifun field of the instruction. We can 
therefore write the HCL description for the ALU control as follows: 


word alufun = [ 
icode == IOPQ : ifun; 
i : ALUADD; 

]; 


The execute stage also includes the condition code register. Our ALU gen- 
erates the three signals on which the condition codes are based—zero, sign, and 
overflow—every time it operates. However, we only want to set the condition 
codes when an OPq instruction is executed. We therefore generate a signal set, cc 
that controls whether or not the condition code register should be updated: 


bool set cc = icode in { IOPQ }; 


The hardware unit labeled *cond" uses a combination of the condition codes 
and the function code to determine whether a conditional branch or data transfer 
should take place (Figure 4.3). It generates the Cnd signal used both for the setting 
of dstE with conditional moves and in the next PC logic for conditional branches. 
For other instructiong, the Cnd signal may be set to either 1 or 0, depending on 
the instruction's function code and the setting of the condition codes, but it will 
be ignored by the control logic. We omit the detailed design of this unit. 


i Mea A RIP UR I PT d G a "h^" 
E d 


Practice Problem:4.24 (solution PAJE ABES i eoim. Lon. SES 
The conditional move instructions, abbreviated cmovXX, have instruction code 
IRRMOVQ. As Figure 4,28 shows, we can implement these instructions by making 
use of the Cnd signal, generated in the execute stage. Modify the HCL code for 
dstE to implement these instructions. 





Memory Stage 


The memory stage has the task of either reading or writing program data. As 
shown in Figure 4.30, two control blocks generate the values for the memory 
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Figure 4.30 

SEQ memory stage. The 
data memory can either 
wtite or read memory 


values. The value read from i 
memory forms the signal instr. valld i 






imem, error 


icode valE  valA valP 


address and the memory input data (for write operations). Two other blocks 
generate the control signals indicating whether to perform a réad or a write 
operation. When a read operation is performed, the data memory generates the 
value valM. 

The desired memory operation for each instruction type is shown in the 
memory stage of Figures 4.18 to 4.21. Observe that the address for memory reads 
and writes is always valE or valA. We can describe this block in HCL as follows: 


word mem, addr = [ 
icode in ( IRMMOVQ, IPUSHQ, ICALL, IMRMOVQ } : valE; 
icode in { IPOPQ, IRET ) : valá; 
# Other instructions don't need address 


l; 


‘Pract ll ds dot Qr halo VET 
fi 


ce Problen:4, Y Mas 
Looking at the memory operatiois for the different instructions shown in = 
ures 4.18 to 4.21, we can see that the dafa for themory writes are always Either 
valA or valP. Write HCL code for the signal mem data in SEQ. 






E hy 


We want to set the control signal mem_read only for instructions that read 
data from memory, as expressed by the following HCL code: 


bool mem_read = icode in { IMRMOVQ, IPOPQ, IRET }; 





We want to set the control mandi) mem write Daly for instructions has write dat i 
to memóry. Write HCL code for the signal mem write in SEQ. 
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Figure 4.31 r 

SEQ PC update stage. The next 
value of the PC is selected from 
among the signals valC, valM, 
and valP, depending on the 
instruction code and the.branch 
flag. 





icode Cnd valC valM valP 


A final function for the memory stage is to compute the status code Stat 
resulting from the instruction execution according to the values of icode, imem_ 
error, and instr_valid generated in the fetch stage and the signal dmem_error 
generated by the data memory. 









Practice Broblem A27 GONG pase Manni Sea nan 
Write HCL code for Stat, generating the four status codes SAOK, SADR, SINS, and 
SHLT (see Figure 4.26), 

PC Update Stage 


The final stage in SE@ generates the new value of the program counter (see Figure 
4.31). As the final steps in Figures 4.18 to 4.21 show, the new PC will be valC, valM, 
or valP, depending on the instruction type and whether or not a branch should be 
taken. This selection can be described in HCL as follows: 


word new pc = [ 
# Call. Use instruction constant 
icode == ICALL : valc; 
# Taken branch. Use instruction constant 
icode == IJXX && Cnd : valc; 
# Completion of RET instruction. Use value from stack 
icode == IRET : valM; 
# Default: Use incremented PC 
1 : valP; 
]; 


Surveying SEQ 


We have now stepped through a complete design for a Y86-64 processor. We 
have seen that by organizing the steps required to execute each of the different 
instructions into a uniform flow, we can implement the entire processor with a 
small number of different hardware units and with a single clock to control the 
sequencing of computations. The control logic must then route the signals between 
these units and generate the proper control signals based on the instruction types 
and the branch conditions. 
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The only problem with SEQ is that it is too slow. The clock must run slowly 
enough so that signals can propagate through all of the stages within a single 
cycle. As an example, consider the processing of a ret instruction. Starting with 
an updated program counter at the beginning of the clock cycle, the instruction 
must be read from the instruction memory, the stack pointer must be read from 
the register file, the ALU must increment the stack pointer by 8, arid the return 
address must be read from the memory in order to determine the next value for 
the program counter. All of these must be completed by the end of the clock cycle. 

This style of implementation does not make very good use of our hardware 
units, since each unit is only active for a fraction of the total clock cycle. We will 
see that we can achieve much better performance by introducing pipelining. 


4.4 General Principles of Pipelining 


Before attempting to design a pipelined Y86-64 processor, let us consider some 
general properties and principles of pipelined systems. Such systems are.familiar 
to anyone who has been through the serving line at a cafeteria or run a car through 
an automated car wash. In a pipelined system, the task to be performed is divided 
into a series of discrete stages. In a cafeteria, this involves supplying salad, a 
main dish, dessert, and beverage. In a car wash, this involves spraying water and 
soap, scrubbing, applying wax, and drying. Rather than having one customer run 
through the entire sequence from beginning to end-before the next can begin, we 
allow multiple customers to proceed through the system at once. In a traditional 
cafeteria line, the customers maintain the same order in the pipeline and pass 
through all stages, even if they do not want some of the courses. In the case of 
the car wash, a new car is allowed to enter the spraying stage as the preceding 
car moves from the spraying stage to the scrubbing stage. In general, the cars 
must move through the system at the same rate to avoid having one car crash into 
the next. 

A key feature of pipelining is that it iricreases the throughput of the system 
(i.e., the number of customers served per unit time), but it may also slightly 
increase the latency (i.e., the time required to service an individual customer). For 
example, a customer in a cafeteria who only wants a dessert could pass through a 
nonpipelined system very quickly, stopping only at the dessert stage. A customerin 
a pipelined system who attempts to go directly to the dessert stage risks incurring 
the wrath of other customers. 


4.4.1 Computational Pipelines 


Shifting our focus to computational pipelines, the “customers” are- instructions Sad 
and the stages perform some portion of the instruction execution. Figure 4.32(a) IE 
shows an example of a simple nonpipelined hardware system. It consists of some 1 
logic that performs a computation, followed by a register to hold the results of this 
computation. A clock signal controls the loading of the register at some regular 
time interval. An example of such a system is the decoder in a compact disk (CD) ! 
player. The incoming signals are the bits read from the surface ofthe CD, and 
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Figure 4,32 300 ps 20 ps 
Unpipelined computation i i LE 
hardware. On each 320 
ps cycle, the system 
spends 300 ps evaluating 
a combinational logic 
function and 20 ps storing Clock 
the results in an output 
register. 









“Combinational 
ye se “ge 


Delay = 320 ps 
Throughput = 3.12 GIPS 








(b) Pipeline diagram 


the logic decodes these to generate audio signals. The computational block in the 
figure is implemented as combinational logic, meaning that the signals will pass 
through a series of logic gates, with the outputs becoming some function of.the 
inputs after some time delay. 

In contemporary logic design, we measure circuit delays in units of picosec- 
onds (abbreviated “ps”), or 107? seconds. In this example, we assume the com- 
binational logic requires 300 ps, while the loading of the register requires 20 ps. 
Figure 4.32 shows a form of timing diagram known as a pipeline diagram. In this 
diagram, time flows from left to right. A series of instructions (here named I1, I2, 
and I3) are written from top to bottom. The solid rectangles indicate the times 
during which these instructions are executed. In this implementation, we must 
complete one instruction before beginning the next. Hence, the boxes do not over- 
lap one another vertically. The following formula gives the maximum rate at which 
we could operate the system: 


1 instruction 1,000 picoseconds 


— £: 3.12 GIPS 
(20 + 300) picoseconds 1 nanosecond 


Throughput = 


We express throughput in units of giga-instructions per second (abbreviated 
GIPS), or billions of instructions per second. The total time required to perform 
a single instruction from beginning to end is known as the latency. In this system, 
the latency is 320 ps, the reciprocal of the throughput. 

Suppose we could divide the computation performed by our system into three 
stages, A, B, and C, where each requires 100 ps, as illustrated in Figure 4.33. Then 
we could put pipeline registérs between the stages so that each instruction moves 
through the system in three steps, requiring three complete clock cycles from 
beginning.to end. As the pipeline diagram in Figure 4.33 illustrates, we could allow 
I2 to enter stage A as soon as 11 moves from A to B, and so on. In steady state, all 
three stages would be active, with one instruction leaving and a new one entering 
the system every clock cycle. We can see this during the third clock cycle in the 
pipeline diagram where I1 is in stage C, I2 is in stage B, and 13 is in stage A. In 











Figure 4.34 


Three-stage pipeline 
timing. The rising edge of 
the clock signal controls the 
movement of instructions 


"i 


Delay = 360ps *Y* 
Throughput — 8.33 GIPS 


(b) Pipeline diagram 
Figure 4.33 Three-stage pipelined computation hardware. The computation is split 
into stages A, B, and C. On each 120 ps cycle, each instruction progresses through one 


stage. 





from one pipeline stage to o. 20 240 380. AB) B00 


the next. 


Time 


this system, we could cycle the clocks every 100 + 20 = 120 picoseconds, giving 
a throughput of around 8.33 GIPS. Since processing a single instruction requires 
3 clock cycles, the latency of this pipeline is 3 x 120 = 360 ps. We have increased 
the throughput of the system by a factor of 8.33/3.12 = 2.67 at the expense of 
some added hardware and a slight increase in the latency (360/320 = 1.12). The 
increased latency is due to the time overhead of the added pipeline registers. 


4.4.2 A Detailed Look at Pipeline Operation : 


To better understand how pipelining works, let us look in some detail at the timing 
and operation of pipeline computations. Figure 4.34 shows the pipeline diagram 
for the three-stage pipeline we have already looked at (Figure 4.33). The transfer 
of the instructions between pipeline stages is coritrolled by a clock signal, as shown 
above the pipeline diagram. Every 120 ps, this signal rises from 0 to 1, initiating 
the next set of pipeline stage evaluations. 

Figure 4.35 traces the circuit activity between times 240 and 360, as instruc- MI 
tion Ii (shown in dark gray) propagates through stage C, I2 (showrf in blue} 
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Figure 4.35 One clock cycle of pipeline operation. Just before the clock rises at 
time 240 (point 1), instructions I1 (shown in dark gray) and I2 (shown in blue) have 
completed stages B and A. After the clock rises, these instructions begin propagating 
through stages C and B, while instruction 13 (shown in light gray) begins propagating 
through stage A (points 2 and 3). Just before the clock rises again, the results for the 
instructions have propagated to the inputs of the pipeline registers (point 4). 
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propagates through stage B, and I3 (shown in light gray) propagates through stage 
A. Just before the rising clock at time 240 (point 1), the values computed in stage A 
for instruction I2 have reached the input of the first pipeline register, but its state 
and output remain set to those computed during stage A for instruction I1. The 
values computed in stage B for instruction I1 have reached the input of the sec- 
ond pipeline register. As the clock rises, these inputs are loaded into the pipeline 
registers, becoming the register outputs (point 2). In addition, the input to stage 
A is set to initiate the computation of instruction I3. The signals then propagate 
through the combinational logic for the different stages (point 3). As the curved 
wave fronts in the diagram at point 3 suggest, signals can propagate through differ- 
ent sections at different rates. Before time 360, the result values reach the inputs 
of the pipeline registers (point 4). When the clock rises at time 360, each of the 
instructions will have progressed through one pipeline stage. 

We can see from this detailed view of pipeline operation that slowing down 
the clock would not change the pipeline behavior. The signals propagate to the 
pipeline register inputs, but no change in the register states will occur until the 
clock rises. On the other hand, we could have disastrous effects if the clock 
were run too fast. The values would not have time to propagate through the 
combinational logic, and so the register inputs would not yet be valid when the 
clock rises. f 

As with our discussion of the timing for the SEQ processor (Section 4.3.3), 
we see that the simple mechanism of having clocked registers between blocks of 
combinational logic suffices to control the flow of instructions in the pipeline. As 
the clock rises and falls repeatedly, the different instructions flow through the 
stages of the pipeline without interfering with one another. 


—X—— I 


cl 


T LL awe. 


* 


aedi dme — ha 


4.43 Limitations of Pipelining 


The example of Figure 4.33 shows an ideal pipelined system in which we are able 
to divide the computation into three independent stages, each requiring one-third 
of the time required by the original logic. Unfortunately, other factors often arise 
that diminish the effectiveness of pipelining. 


MEE 0 Te 


Nonuniform Partitioning 


Figure 4.36 shows a system in which we divide the computation into three stages 
as before, but the delays through the stages range from 50 to 150 ps. The sum of 
the delays through all of the stages remains 300 ps. However, the rate at which we $i 
can operate the clock is limited by the delay of the slowest stage. As the pipeline $ 
diagram in this figure shows, stage A will be idle (shown as a white box) for 28 
100 ps every clock cycle, while stage C will be idle for 50 ps every clock cycle. Only ; ' 
stage B will be continuously active. We must set the clock cycle to 150 + 20 = 170 ME 
picoseconds, giving a throughput of 5.88 GIPS. In addition, the latency Would TE 
increase to 510 ps due to the slower clock rate. 

Devising a partitioning of the system computation into a series’ of stages 
having uniform delays can be a major challenge for hardware designers.'Often, 
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100 ps 20 ps 


VEESL EP 


Delay — 510 ps 
Throughput = 5.88 GIPS 


(b) Pipeline diagram 


Figure 4.36 Limitations of pipelining due to nonuniform stage delays. The system 
throughput is limited'by the speed of the slowest stage. 
" 


some of the hardware units im a processor, such as the ALU and the memories, 
cannot be subdivided into multiple units with shorter delay. This makes it difficult 
to create a set of balanced stages. We will not concern ourselves with this level of 
detail in designing our pipelined Y86-64 processor, but it is important to appreciate 
the importance of timing optimization in actual system design. 


PracticeProbiem:4:28-Tinlugon Bade ARS LUIS o D ECT TERME 
Suppose we analyze the combinational logic of Figure 4.32 and determine that it 
can be separated into a sequence of six blocks, named A to F, having delays of 80, 
30, 60, 50, 70, and 10 ps, respectively, illustrated as follows: 


30 ps 60 ps 10ps 20 ps 


We can create pipelined versions of this design by inserting pipeline registers 
between pairs of these blocks. Different combinations of pipeline depth (how 
many stages) and maximum throughput arise, depending on where we insert the 
pipeline registers. Assume that a pipeline register has a delay of 20 ps. 


A. Inserting a single register gives a two-stage pipeline. Where should the 
register bé inserted to maximize throughput? What would be the throu ghput 
and latency? 





418 Chapter 4 Processor Architecture 


B. Where should two registers be inserted to maximize the throughput of a 
three-stage pipeline? What would be the throughput and latency? 


C. Where should three registers be inserted to maximize the throughput of a 
4-stage pipeline? What would be the throughput and latency? 


D. What is the minimum number of stages that would yield a design with the 
maximum achievable throughput? Describe this design, its throughput, and 
its latency. 


O_O BUA a LE 


Diminishing Returns of Deep Pipelining 


Figure 4.37 illustrates another limitation of pipelining. In this example, we have 
divided the computation into six stages, each requiring 50 ps. Inserting a pipeline 
register between each pair of stages yields a six-stage pipeline. The minimum 
clock period for this system is 50 +20 — 70 picoseconds, giving a throughput of 
14.29 GIPS. Thus, in doubling the number of pipeline stages, we improve the 
performance by a factor of 14.29/8.33 = 1.71. Even though we have cut the time 
required for each computation block by a factor of 2, we do not get a doubling of 
the throughput, due to the delay through the pipeline registers. This delay becomes 
a limiting factor in the throughput of the pipeline. In our new design, this delay 
consumes 28.696 of the total clock period. 

Modern processors employ very deep pipelines (15 or more stages) in an 
attempt to maximize the processor clock rate. The processor architects divide the 
instruction execution into a large number of very simple steps so that each stage 
can have a very small delay. The circuit designers carefully design the pipeline 
registers to minimize their delay. The chip designers must also carefully design the 
clock distribution network to ensure that the clock changes at the exact same time 
across the entire chip. All of these factors contribute to the challenge of désigning 
high-speed microprocessors. 


ME 


Practice. P Jem. 4.29 - (solution: ES Tt ar MNT E 
Suppose we could take the system of Figure 4.32 and divide it into an arbitrary 
number of pipeline stages k, each having a delay of 300/k, and with each pipeline 
register having a delay of 20 ps. 
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Delay = 420 ps, throughput = 14.29 GIPS 


Figure 4.37 Limitations of pipelining due to overhead. As the combinational logic is 
split into shorter blocks, the delay due to register updating becomes a limiting factor. 
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A. What would be the latency and the throughput of thé systém, as furictions 
of k? 


B. What would be the ultimate limit on the throughput? 


{ 
i 


4.4.4 Pipelining a System with Feedback 


Up to this point, we have considered only systems in which the objects passing 
through the pipeline—whether cars, people, or instructions—are completely in- 
dependent of one another. For a system that executes machine programs such as 
x86-64 or Y86-64, however, there are potential dependencies between successive 
instructions. For example, consider the following Y86-64 instruction sequence: 


1 irmovq $50, (iras) 
SI 
3 mrmovq 100€ (os) ), žrdx 


In this three-instruction sequence,:tbere is a data dependency between each 
successive pair of instructions, as indicated by the circled register names and the 
arrows between them. Thé irmovg instruction (line 1) stores its result in %rax, 
which then must be read by the addq instruction (line 2); and this instruction stores 
its result in %rbx, whichanust then be read by the mrmovq instruction (line 3). 

Another source of sequential dependencies occurs due to the instruction 
control flow. Consider the following Y86-64 instruction sequence: 


^ 


1 loop: 

2 subg %rdx, %rbx 
3 jne targ 

4 irmovq $10,%rdx 
5 jmp loop 

6 targ: 

7 halt 


The jne instruction (line 3) creates a control dependency since the outcome of 
the conditional test determines whether the next instruction to execute will be the 
irmovq instruction (line 4) or the halt instruction (line 7). In our design for SEQ, 
these dependencies were handled by the feedback paths shown on the right-hand 
side of Figure'4.22. This feedback brings the updated register values dowa to the 
register file and the new PC value downto the PC register. 

Figure 4.38 illustrates the perils of introducing pipelining into a system con- 
taining feedback paths. In the original system (Figure 4.38(a)), the-result of each 
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Figure 4.38 Limitations of pipelining due to logical dependencies. In going from an l 
unpipelined system with feedback (a) to a pipelined one (c), we change its computational i 
behavior, as can be seen by the two pipeline diagrams (b and d). 
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instruction is fed back around to the next instruction. This is illustrated by the j 
pipeline diagram (Figure 4.38(b)), where the tesult of I1 becomes-an input to | 
12, and so on. Hwe attempt to convert this to a three-stage pipeline in the most į 
straightforward manner (Figure 4.38(c)), we change the behavior of the system. 
As Figure 4.38(c) shows, the result of 11 becomes an input to 14.In attempting to | 
speed up the.system via pipelining, we have changed the system behavior. +» 
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When we introduce pipelining into a Y86-64 processor, we must deal with 
feedback effects properly. Clearly, it would be unacceptable to alter the system 
behavior as occurred in the example of Figure 4.38. Somehow we must deal 
with the data and control dependencies between instructions so that the resulting 
behavior matches the model defined by the ISA. 
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4.5 Pipelined Y86-64 Implementations 


We are finally ready for the major task of this chapter— designing a pipelined Y86- 
64 processor. We start by making a small adaptation of the sequential processor 
SEO to shift the computation of the PC into the fetch stage. We then add pipeline 
registers between the stages. Our first attempt at this does not handle the different 
data and control dependencies properly. By making some modifications, however, 
we achieve our goal of an efficient pipelined processor that implements the Y86- 
64 ISA. 


4.5.1 SEQ+: Rearranging the Computation Stages 


As a transitional step toward a pipelined design, we must slightly rearrange the 
order of the five stages in SEQ so that the PC update stage comes at the beginning 
of the clock cycle, rather than at the end. This transformation requires only 
minimal change to the overall hardware structure, and it will work better with 
the sequencing of activities within the pipeline stages. We refer to this modified 
design as SEQ+. 

We can move the PC update stage so that its logic is active at the beginning of 
the clock cycle by making it compute the PC value for the current instruction. 
Figure 4.39 shows how SEQ and SEQ+ differ in their PC computation. With 
SEQ (Figure 4.39(a)), the PC computation takes place at the end of the clock 
cycle, computing the new value for the PC register based on the values of signals 
computed during the current clock cycle. With SEQ+ (Figure 4.39(b)), we create 
state registers tò hold the signals computed during an instruction. Then, as a 
new clock cycle begins, the values propagate through the exact same logic to 
compute the PC for the now-current instruction. We label the registers *pIcode," 





(8) SEQ new PC computation (b) SEQ* PC selection 








icode Cnd valC valM valP 





Figure 4.39 Shifting the timing of the PC computation. With SEQ+, we compute 
the value of the program counter for the current state as the first step in instruction 
execution. 
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One curious feature of SEQ+ is thatthere is no hardwaré-régister'stOting thé programcounter. Instead, * 
the PC is computed dynamically based on gomé state ihfdrination stored from theprevious instructiort., 
This is a small illustration of the fact«that we can implement à pfocessor in a wif that’ differs from the 
conceptual model implied by the ISA, as long as the7processor correctly executes arbitrary machine- 
language programs. We need not encode the state in the form indicated by the programmer-visible state, 
as long as the processor can generate correct values for any part of the programmer-visible state (such 
as.the program counter): We will exploit this principle.even more-in creating a pipelined design.-Out- 
of-order processing techniques, as described in Section 5.7, take this idea to an extreme'by executing 
instructions in a completely:different order,than'they occur jn the machiné-level program: , i^ — , 
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“pCnd,” and so on, to indicate that on any given cycle, they hold the control signals 
generated during the previous cycle. , 

Figure 4.40 shows a more detailed view of the SEQ+ hardware. We can see 
that it contains the exact same hardware units and control blocks that we had in 
SEQ (Figure 4.23), but with the PC logic shifted from the top, where it was active 
at the end of the clock cycle, to the bottom, where it is active at the beginning. 

The shift of state elements from SEQ to SEQ+ is an example of a general 
transformation known as circuit retiming [68]. Retiming changes the state repre- 


sentation for a system without changing its logical behavior. It is often used to 
balance the delays between the different stages of a pipelined system. 


4.5.2 Inserting Pipeline Registers 


In our first attempt at creating a pipelined Y86-64 processor, we insert pipeline 
registers between the stages of SEQ+ and rearrange signals somewhat, yielding 
the PIPE— processor, where the “—” in the name signifies that this processor has 
somewhat less performance than our ultimate processor design. The structure of 
PIPE- is illustrated in Figure 4.41. The pipeline registers are shown in this figure 
as blue boxes, each containing different fields that are shown as white boxes. As 
indicated by the multiple fields, each pipeline register holds multiple bytes and 
words. Unlike the labels shown in rounded boxes in the hardware structure of the 
two sequential processors (Figures 4.23 and 4.40), these white boxes represent 
actual hardware components. 

Observe that PIPE— uses nearly the same set of hardware units a5 our sequen- 
tial design SEQ (Figure 4.40), but with the pipeline registers separating the stages. 
The differences between the signals in the two systems is discussed in Section 4.5.3. 

The pipeline registers are labeled as follows: 


F holds a predicted value of the program counter, as will be discussed shortly. 


D sits between the fetch and decode stages. It holds information about the most 
recently fetched instruction for processing by the decode stage. 
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Figure 4.40 SEQ+ hardware structure. Shifting the PC computation from the end of 
the clock cycle to the beginning makes it more suitable for pipelining. 
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Figure 4.41 Hardware structure of PIPE-, an initial pipelined implementation. By 
inserting pipeline registers info SEQ+ (Figure 4.40), we create a five-stage pipeline. There 
are several shortcomings of this version that we will deal with shortly. 
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E sits between the decode and execute stages. It holds information about the 
most recently decoded instruction and the values read from the register 
file for processing by the execute stage. P 


M sits between the execute and memory stages. It holds the rests of the 
most recently executed instruction for processing by the Yemory stage. 
It also holds information about branch conditions and branch targets for 
processing conditional jumps. 


W sits between the memory stage and the feedback paths that supply the 
computed results to the register file for writing and the return address 
to the PC selection logic when completing a ret instruction. 


Figure 4.42 shows how the following code sequence would flow through our 


five-stage pipeline, where the comments identify the instructions as 11 to I5 for 
reference: 


irmovq $1,%rax # I1 
irmovq $2,%4rbx # I2 
irmovq $3,%rcx # I3 
irmovq $4,%rdx # 14 
halt # I5 


wu A wN = 


irmovq $1, %rax #I1 
irmovq $2, 4 rbx #12 





ifmovq $3, rcx #13 





irmovq $4, %rdx #14 
halt #I5 


E. Ma 
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The right side of the figure shows a pipeline diagram for this instruction 
sequence, As with the pipeline diagrams for the simple pipelined computation 
units of Section 4.4, this diagram shows the progression of each instruction through 
the pipeline stages, with time increasing from left to right. The numbers along the 
top identify the clock cycles at which the different stages occur. For example, in 
cycle 1, instruction 11 is fetched, and it then proceeds through the pipeline stages, 
with its result being written to the register file after the end of cycle 5. Instruction 
I2 is fetched in cycle 2, and its result is written back after the end of cycle 6, and 
so on. At the bottom, we show an expanded view of the pipeline for cycle 5. At 
this point, there is an instruction in each of the pipeline stages. 
From Figure 4.42, we can also justify our convention of drawing processors 
so that the instructions flow from bottom to top. The expanded view for cycle 5 
shows the pipeline stages with the fetch stage on the bottom and the write-back 
stage on the top, just as do our diagrams of the pipeline hardware (Figure 4.41). 
If we look at the ordering of instructions in the pipeline stages, we see that they 
appear in the same order as they do in the program listing. Since normal program 
flow goes from top to bottom of a listing, we preserve this ordering by having the 
pipeline flow go from bottom to top. This convention is particularly useful when 
working with the simulators that accompany this text. 





4.5.3 Rearranging and Relabeling Signals 


Our sequential implementations SEQ and SEQ+ only process one instruction at 
a time, and so there are unique values for signals such as valC, srcA, and valE. In 
our pipelined design, there will be multiple versions of these values associated 
with the different instructions flowing through the system. For example, in the 
detailed structure of PIPE-, there are four white boxes labeled “Stat” that hold 
the status codes for four different instructions (see Figure 4.41). We need to take 
great care to make sure we use the proper version of a signal, or else we could 
have serious errors, such as storing the result computed for one instruction at the 
destination register specified by another instruction. We adopt a naming scheme 
where a signal stored in a pipeline register can be uniquely identified by prefixing 
its name with that of the pipe register written in uppercase. For example, the four 
status codes are named D stat, E stat, M stat, and W stat, We also need to refer 
to some signals that have just been computed within a stage. These are labeled 
by prefixing the signal name with the first character of the stage name, written 
in lowercase. Using the status codes as examples, we can see control logic blocks 
labeled “Stat” in the fetch and memory stages. The outputs of these blocks are 1 
therefore named f. stat and m stat. We can also see that the actual status of the 4 
overall processor Stat is computed by a block in the write-back stage, based on 
the status value in pipeline register W. l 
The decode stages of SEQ+ and PIPE- both generate signals dstE and dstM § 
indicating the destination register for values valE and valM. In SEQ+, we could 
connect these signals directly to the address inputs of the register file write ports. 
With PIPE—, these signals are carried along in the pipeline through the execute 
and memory stages and are directed to the register file only once they reach 
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Aside What is the difference between signals M stat and m: stat? $ : 
With'our naming system; the uppercase prefixes ‘D’, ‘E’,“M’, and *W"refer to pipeliné registers, and so 


M stat refers to the status code figid,of pipeline register M. The lowercase prefixes P’, ‘Œ, ‘e’, ‘m’, ahd* 
‘w’ refer to the pipeline stages, and so"m_stat refers to the status'signal'generated in the memory stage 


by a control logic block. 

“Understanding this naming corivention is critical to understanding-the operation of our pipelined 
processors. l 
EE mesta ^o iha Vo veo or E s 


the write-back stage (shown in the more detailed views of the stages). We do 
this to make sure the write port address and data inputs hold values from the 
same instruction. Otherwise, the write back would be writing the values for the 
instruction in the write-back stage, but with register IDs from the instruction in 
the decode stage. As a general principle, we want to keep all of the information 
about a particular instruction contained within a single pipeline stage. 

One block of PIPE- that is not present in SEQ+ in the exact same form is the 
block labeled “Select A” in the decode stage. We can see that this block generates 
the value valA for the pipeline register E by choosing either valP from pipeline 
register D or the value read from the A port of the register file. This block is 
included to reduce the amount of state that must be carried forward to pipeline 
registers E and M. Of all the different instructions, only the ca11 requires valP 
in the memory stage. Only the jump instructions require the value of valP in the 
execute stage (in the event the jump is not taken). None of these instructions 
requires a value read from the register file. Therefore, we can reduce the amount 
of pipeline register state by merging these two signals and carrying them through 
the pipeline as a single signal valA. This eliminates the need for the block labeled 
“Data” in SEQ (Figure 4.23) and SEQ+ (Figure 4.40), which served a similar 
purpose. In hardware design, it is common to carefully identify how signals get 
used and then reduce the amount of register state and wiring by merging signals 
such as these. 

As shown in Figure 4.41, our pipeline registers include a field for the status 
code stat, initially computed during the fetch stage and possibly modified during 
the memory stage. We will discuss how to implement the processing of exceptional 
events in Section 4.5.6, after we have covered the implementation of normal in- 
struction execution. Suffice it to say at this point that the most systematic approach 
is to associate a status code with each instruction as it passes through the pipeline, 
as we have indicated in the figure. 
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4.5.4 Next PC Prediction 


We have taken some measures in the design of PIPE— to properly handle control 
dependencies. Our goal in the pipelined design is to issue a new instruction on 
every clock cycle, meaning that on each clock cycle, a new instruction proceeds 
into the execute stage and will ultimately be completed. Achieving this goal would 
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Aside Other branch prediction strategies en 


Our design uses an always taken branch prediction strategy. Studies show this strategy has around à 
60% success rate [44, 122]. Conversely, a never taken (NT) strategy has around a 40% success rate. A 
slightly more sophisticated strategy, known as backward taken, forward not taken (BTFNT), predicts 
that branches to lower addresses than the next instruction wil] be taken, while those to higher addresses 
will not be taken. This strategy has a success rate of around 65%. This improvement stems from the fact 
that loops are closed by backward branches and loops are generally executed multiple times. Forward 
branches are used for conditional operations, and these are less likely to be taken. In Problems 4.55 
and 4.56, you can modify the Y86-64 pipeline processor to implement the NT and BTFNT branch 
prediction strategies. i 

As we saw in Section 3.6.6, mispredicted branches can degrade the performance of d program 
considerably, thus motivating the use of conditional data transfer rather than conditional control 


transfer when possible. 


^ 


yield a throughput of one instruction per cycle. To do this, we must determine 
the location of the next instruction right after fetching the current instruction. 
Unfortunately, if the fetched instruction js a conditional branch, we will not 
know whether or not the branch should be taken until several cycles later, after 
the instruction has passed through the execute stage. Similarly, if the fetched 
instruction is a ret, we cannot determine the return location until the instruction 
has passed through the memory stage. 

With the exception of conditional jump instructions and ret, we can deter- 
mine the address of the next instruction based on information computed during 
the fetch stage. For ca11 and jmp (unconditional jump), it will be valC, the con- 
stant word in the instruction, while for all others it wil] be valP, the address of the 
next instruction. We can therefore achieve our goal of issuing a new instruction 
every clock cycle in most cases by predicting the next value of the PC. For most in- 
struction types, our prediction will be completely reliable. For conditional jumps, 
we can predict either that a jump will be taken, so that the new PC value would be 
valC, or that it will not be taken, so that the new PC value would be valP. In either 
case, we must somehow deal with the case where our prediction was incorrect and 
therefore we have fetched and partially executed the wrong instructions. We will 
return to this matter in Section 4.5.8. 

This technique of guessing the branch direction and then initiating the fetching 
of instructions according to our guess is known as branch prediction. Yt is used in 
some form by virtually all processors. Extensive experiments have been conducted 
on effective strategies for predicting whether or not branches will be taken [46, 
Section 2.3]. Some systems devote large amounts of hardware to this task. In our 
design, we will use the simple strategy of predicting that conditional branches are 
always taken, and so we predict the new value of the PC to be valC. 

We are still left with predicting the new PC value resulting from a ret in- 
struction, Unlike conditional jumps, we have a nearly unbounded set of possible 
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Aside Return address prediction with a stack= a. 
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With most programs, itis very easy to predict return addressés, since procedure calls and returns occur 
in matched pairs. Most of the time that a procedure is called, it returns to the instruction following the 
call. This property is exploited in high-performance processors by including a hardware stack within 
the instruction fetch unit that holds the return address generated by procedure call instructions. Every 
time a procedure call instruction-is executed, its return address is pushed onto the stack. When a return 
irstruction is fetched, the top,value is popped from this stack ahd used as the predicted return address. 
Like branch.prediction, a mechanism must be provided to recover when the prediction was incorrect, 
since there aré times when calls and returns do not match. in general, the prediction is highly reliable. 
This hardware stack is not part of the programmer-visible state. ý 
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results, since the return address will be whatever word is on the top of the stack. 
In our design, we will not attempt to predict any value for the return address. 
Instead, we will simply hold off processing any more instructions until the ret 
instruction passes through the write-back stage. We will return to this part of the 
implementation in Section 4.5.8. 

The PIPE- fetch stage, diagrammed at the bottom of Figure 4.41, is respon- 
sible for both predicting the next value of the PC and selecting the actual PC for 
the instruction fetch. We can see the block labeled “Predict PC” can choose either 
valP (as computed by the PC incrementer) or valC (from the fetched instruction). 
This value is stored in pipeline register F as the predicted value of the program 
counter. The block labeled “Select PC” is similar to the block labeled “PC” in the 
SEQ+ PC selection stage (Figure 4.40). It chooses one of three values to serve as 
the address for the instruction memory: the predicted PC, the value of valP for 
a not-taken branch instruction that reaches pipeline register M (stored in regis- 
ter M_valA), or the value of the return address when a ret instruction reaches 
pipeline register W (stored in W. valM). 


4.5.5 Pipeline Hazards 


Our structure PIPE— is a good start at creating a pipelined Y86-64 processor. 
Recall from our discussion in Section 4.4.4, however, that introducing pipelining 
into a system with feedback can lead to problems when there are dependencies 
between successive instructions. We must resolve this issue before we can com- 
plete our design. These dependencies can take two forms: (1) data dependencies, 
where the results computed by one instruction are used as the data for a.follow- 
ing instruction, and (2) control dependencies, where one instruction determines 
the location of the following instruction, such as when executing a jump, call, or 
return. When such dependencies have the potential to cause an erroneous com- 
putation by the pipeline, they are called hazards. Like dependencies, hazards can 
be classified as either data hazards or control hazards. We first concern ourselves 
with data hazards and then consider control hazards, 
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# progl 
Ox000: irmovq $10,%rdx 
Ox00a: irmovq $3,%rax 







0x014: nop ps |o | eJ v He 
0x015: nop ER LE, 
0x016: nop "FU p 
0x017: addq “rdx, ýrax 

0x019; halt 


valA + R[4rdx] = 10 
valB *— R[Arax] = 3 


DL ELTE es ee C XE. 





Figure 4.43 Pipelined execution of prog1 without special pipeline control. In cycle 
6, the second irmovq writes its result to program register Arax. The addq instruction 
reads its source operands in cycle 7, so it gets correct values for both %rax and rax. 


Figure 4.43 illustrates the processing of a sequence of instructioris we refer to 
as progi by the PIPE— processor. Let us assume in this example and successive 
ones that the program registers initially all have value 0. The code loads vaiues 
10 and 3 into program registers %rdx and "rax, executes three nop instructions, 
and then adds register Zrdx to %rax. We focus our attention on the potential data 
hazards resulting from the data dependencies between the two irmovg instructions 
and the addq instruction. On the right-hand side of the figure, we show a pipeline 
diagram for the instruction sequence. The pipeline stages for cycles‘ and 7 are 
shown highlighted in the pipeline diagram. Below this, we show an expanded view | 
of the write-back activity in cycle 6 and the decode activity during cycle 7. After | 
the start of cycle 7; both of the irmovgq instructions have passed through the write- | 
back stage, and so the register file holds the updated values of %rdx and %rax. 
As the addq instruction passes through the decode stage during cycle 7, it will 
therefore read the correct values for its source operands. The data dependencies 
between the two irmovq instructions and the addq instruction have not created 
data hazards in this example. 

We saw that progi will flow through our pipeline and get the correct results, 
because the three nop instructions create a delay between instructions with data 
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# prog2 1 2 
0x000: irmovq $10, %rdx 
0x00a: irmovq $3,%rax 
0x014: nop 
0x015: nop 
0x016: addq %rdx, %rax 
0x018: halt 











valA +— R[Zrdx] = 10 
valB «— R[Arax] = 0 
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Figure 4.44 Pipelined execution of prog2 without special pipeline control. The 
write to program register %rax does not occur until the start of cycle 7, and so the addq 
instruction gets the incorrect value for this register in the decode stage. 


dependencies. Let us see what happens as these nop instructions are removed. 
Figure 4.44 illustrates the pipeline flow of a program, named prog2, containing 
two nop instructions between the two irmovq instructions generating values for 
registers %rdx and %rax and.the addq instruction having these two registers as 
operands. In this case, the crucial step occurs in cycle 6, when the addq instruc- 
tion reads its operands from the register file. An expanded view of the pipeline 
activities during this cycle is shown at the bottom of the figure. The first irmovq 
instruction has passed through the write-back stage, and so program register %rdx 
has been updated in the register file. The second irmovg instruction is in the write- 
back-stage during this cycle, and so the write to program register %rax only occurs 
at the start of cycle 7 as the clock-rises, As a result, the incorrect value zero would 
be read for register Xrax (recall that we assume all registers are initially zero), 
since the pending write for this register has not yet occurred. Clearly, we will have 
to adapt our pipeline to handle this.hazard properly. 

Figure 4.45 shows what happens when we have only one nop instruction 
between the irmovq instructions and the addq instruction, yielding a program 
prog3. Now we must examine the behavior of the pipeline during cycle 5 as the 
addq instruction passes through the decode stage. Unfortunately, the pending 
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# prog3 

0x000: irmovq $10,%rdx 
Ox00a: irmovq $3,%rax 
0x014: nop 

0x015: addq %rdx,4rax 
0x017: halt 


M valE 23 
M. dstE = ^r 
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Figure 4.45 Pipelined execution of prog3 without special pipeline control. In cycle 
5, the addq instruction reads its source operands from the register file. The pending 
write to register %rdx is still in the write-back stage, and the pending write to register 
%rax is still in the memory stage. Both operands valA and valB get incorrect values. 


write to register %rdx is still in the write-back stage, and the pending write to : 
Yrax is still in the memory stage. Therefore, the addq instruction would get the 
incorrect values for both operands. 

Figure 4.46 shows what happens when we remove all of the nop instructions 
between the irmovq instructions and the addq instruction, yielding a program 
prog4. Now we must examine the behavior of the pipeline during cycle 4 as the 
addq instruction passes through the decode stage. Unfortunately, the pending i 
write to register %rdx is still in the memory stage, and the new value for xax | 
is just being computed in the execute stage. Therefore; the addq instruction would | 
get the incorrect values for both operands. 

These examples illustrate that a data hazard can arise for an instruction 
when one of its operands is updated by any of the three preceding instructions. 
These hazards occur because our pipelined processor reads,the operands for an 
instruction from the register file in the decode stage but does not write the results 
for the instruction to the register file until three cycles later, after the instruction 
passes through the write-back stage. 
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# progá 

0x000: irmovq $10,%rdx 
0x00a: irmovq $3,%rax 
0x014: addq %rdx,%rax 
0x016: halt 











M. valE = 10 
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valA +- R[rdx] = 0 
valB «— H[4rax] = 0 
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Figure 4.46 Pipelined execution of prog4 without special pipeline control. in cycle 
4, the addq instruction reads its source operands from the register file. The pending 
write to register %rdx is still in the memory stage, and the new value for register %rax 


is just being computed in the execute stage. Both operands valA and valB get incorrect 
values. 








Avoiding Data Hazards by Stalling 






One very general technique for avoiding hazards involves stalling, where the 
processor holds back one or more instructions in the pipeline until the hazard 
condition no longer holds. Our processor can avoid data hazards by holding back 
an instruction in the decode stage until the instructions generating its source op- 
erands have passed through the write-back stage. The details of this mechanism 
will be discussed in Section 4.5.8. It involves simple enhancements to the pipeline 
control logic. The effect of stalling is diagrammed in Figure 4.47 (prog2) and Fig- 
ure 4.48 (prog4). (We omit prog3 from this discussion, since it operates similarly 
to the other two examples.) When the addq instruction is in the decode stage, 
the pipeline control logic detects that at least one of the instructions in the exe- 
cute, memory, or write-back stage will update either register %rdx or register Zrax. 
Rather than letting the addq instruction pass through the stage with the incorrect 
results, it stalls the instruction, holding it back in the decode stage for either one 
(for prog2) or three (for prog4) extra cycles. For all three programs, the addq in- 


struction finally gets correct values for its two source operands in cycle 7 and then 
proceeds down the pipeline. 
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# prog2 
0x000: irmovq $10, %rdx 
Ox00a: irmovq $3,%rax 
0x014: nop 
0x015: nop 

bubble 
0x016: addlq %rdx,%rax 
0x018; halt 


Figure 4.47 Pipelined execution of prog2 using stalls. After decoding the addq 
instruction in cycle 6, the stall control logic detects a data hazard due to the pending 
write to register Zrax in the write-back stage. It injects a bubble into the execute stage 
and repeats the decoding of the addq instruction in cycle 7. In effect, the machine has 
dynamically inserted a nop instruction, giving a flow similar to that shown for progi 
(Figure 4.43). 


# prog4 
0x000: irmovq $10,4%rdx 
Ox00a: irmovq $3,%rax 
bubble 
bubble 
bubble 
0x014: addq Ardx,Wrex 
Ox016: halt 


Figure 4.48 Pipelined execution of prog4 using stalls. After decoding the addq 
instruction in cycle 4, the stall control logic detects data hazards for both source registers. 
It injects a bubble into the execute stage and repeats the decoding of the addq instruction 
on cycle 5. It again detects hazards for both source registers, injects a bubble into the 
execute stage, and repeats the decoding of the addq instruction on cycle 6. Still, it 
detects a hazard for source register Arax, injects a bubble into the execute stage, and 
repeats the decoding of the addq instruction on cycle 7. In effect, the machine has 
dynamically inserted three nop instructions, giving a flow similar to that shown for 
prog (Figure 4.43). 


In holding back the addq instruction in the decode stage, we must also hold 
back the halt instruction following it in the fetch stage. We can do this by keeping 
the program counter at a fixed value, so that the halt instruction will be fetched 
repeatedly until the stall has completed. 

Stalling involves'holding back one group of instructions in their stages while 
allowing other instructions to continue flowing through the pipeline. What then 
should we do in the stages that would normally be processing the addq instruction? 
We handle these by injecting a bubble into the execute stage each time we hold 
an instruction back in the decode stage. A bubble is like a dynamically generated i 
nop instruction—it does not cause any changes to the registers, the memory, the 
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' Aside Enumerating classes of data hazards 


Hazards can pótentially occur when one instruction updates part of the program state. that will be 
read 6y a later instruction. For Y86:64, the program state'includés the program registers, the program 
counter, the’memory, the tondition-code-register, and the status«register. Let us look at'the hazard 
possibilities iti our proposed'desigh for each of these forms of state. 
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Program registers. These are the,hazards we have already identified. They arise because the register 
: file is read in, one.stage aid written in another,.leadling:to possible unintended interactions 
between different instructions. *« 


Program counter. Conflicts between updating and reading the program counter give rise to control 
hazards. No hazard arises when our fetch-stage logic correctly predicts the new value of 
the program counter before fetching the, next instruction. Mispredicted branches and ret 
instructions require special handling, as will-be discussed in Section 4.5.5. 
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Memory. Writes and reads of the data memory both occur in the merhory stage. By the time an 
anstruction reading mémoty réaches this: stage, any preceding instructions writing memory 
will-have already done so: On the other hand, there can be interfererice between instructions 
writing data in the themoty stage and tlie reading of ifstructions in the fetch stage, since the 
instruction‘and data memories reference'a single address space. This can only happen with 
programs containing self-modifying code, where instructions write to a portion of memory 
from which instructions are later fetched. Some systems have complex mechanisms to detect 
and avóid such hazards, while others ‘simply mandate that progtams should not use self- 
modifying code. We will assume for simplicity. that programs do not modify themselves, and 
therefore we do not need to take special measures to update the instruction memory based 
on updates to the data memory during program execution. 
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Condition code register. These are written by integer operations in the execute stage. They are read by 
conditional moves ín the execute stage and by conditional jumps in the memory stage. By the 
time a conditional move or jump reaches the execute stage, any preceding integer operation 
will have already completed this stage. No hazards can arise. 


rr Wen as MM ERR SER 


Status register. The program status can be affected by instructions as they flow through the pipeline. 
Our mechanism of associating a:status code with each instruction in the pipeline enables 
the processor to come to an orderly halt when an exception occurs, as will be discussed in 
Section 4.5.6. . r 


This analysis shows that we only. need to deal with register data hazards, control hazards, and 
making sure exceptions are handled properly. A systematic analysis of this form is important when 
designing a complex system. It can identify the.potential difficulties in implementing the system, and it 
can guide the generation of test programs, to be used in checking the correctness of the system. 


condition codes, or the program status. These are shown as white boxes in the 
pipeline diagrams of Figures 4.47 and 4.48. In these figures the arrow between i 
the box labeled “D” for the addq instruction and the box labeled “E” for one of 
the pipeline bubbles indicates that a bubble was injected into the execute stage in 
place of the addq instruction that would normally have passed from the decode to 
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the execute stage. We will look at the detailed mechanisms for making the pipeline 
stall and for injecting bubbles in Section 4.5.8. 

In using stalling to handle data hazards, we effectively execute programs 
prog2 and prog4 by dynamically generating the pipeline flow seen for progi (Fig- 
ure 4.43), Injecting one bubble for prog2 and three for prog4 has the same effect 
as having three nop instructions between the second irmovq instruction and the 
addq instruction. This mechanism can be implemented fairly easily (see Problem 
4.53), but the resulting performance is not very good. There are numerous cases 
in which one instruction updates a register and a closely following instruction uses 
the same register. This will cause the pipeline to stall for up to three cycles, reduc- 
ing the overal] throughput significantly. 


Avoiding Data Hazards by Forwarding 

Our design for PIPE— reads source operands from the register file in the decode 
stage, but there can also be a pending write to one of these source registers in 
the write-back stage. Rather than stalling until the write has completed, it can 
simply pass the value that is about to be written to pipeline register E as the 
source operand. Figure 4.49 shows this strategy with an expanded view of the 
pipeline diagram for cycle 6 of prog2. The decode-stage logic detects that register 


# prog2 1 2 3 4 5 & 7 8 
0x000: irmovq $10,%rdx 
Ox00a: irmovq $3,%rax 
0x014: nop 
0x015: nop 
0x016: addq %rdx,%rax 
0x018: halt 
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srcA = A rdx valA «— R[%rdx] = 10 
srcB = 4 rax valB «- W. valE = 3 
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Figure 4.49 Pipelined execution of prog2 using forwarding. In cycle 6, the decode- 
stage logic detects the presence of a pending write to register 4rax in the write-back 
stage. It uses this value for source operand valB rather than the value read from the 


register file. 
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# prog3 1 2 3 4 5 6 7 8 9 
| 0x000: irmovq $10, %rdx Fe PD. E deM ei 
0x00a: irmovq $3,%rax 
0x014: nop 
0x005: addq %rdx,%rax 
0x017: halt 












W dstE = %rdx 
W_valE = 10 
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SICÀ = %rdx valA *— W valE = 10 
SrcB = rax valB «— M valE = 3 
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Figure 4.50 Pipelined execution of prog3 using forwarding. In cycle 5, the decode- 
stage logic detects a pending write to register %rdx in the write-back stage and to 
register 4rax in the memory stage. It uses these as the values for valA and valB rather 
than the values read from the register file. 
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&rax is the source register for operand, valB, and that there is also a pending 
write to rax on write port E. It can therefore avoid stalling by simply using the 
data word supplied to port E (signal W valE) as the value for operand valB. This 
technique of passing a result value directly from one pipeline stage to an earlier 
one is commonly known as data forwarding (or simply forwarding, and sometimes 
bypassing). It allows the instructions of prog2 to proceed through the pipeline 
without any stalling. Data forwarding requires adding additional data connections 
and control logic to the basic hardware structure. 

As Figure 4.50 illustrates, data forwarding can also be used when there is 
a pending write to a register in the memory stage, avoiding the need to stall 
for program prog3. In cycle 5, the decode-stage logic detects a pending write to 
register %rdx on port E in the write-back stage, as well as a pending write to register 
&rax that is on its way to port E but is still in the memory stage. Rather than stalling 
until the writes have occurred, it can use the value in the write-back stage (signal 


W valE) for operand valA and the value in the memory stage (signal M valE) for 
operand valB. 
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# prog4 

0x000: irmovq $10, %rdx 
Ox00a: irmovq $3,%rax 
0x014: addq Ardx, rax 
Ox016: halt 


M. dstE = %rdx 
M. valE = 10 


E dstE = rax 
e valE + 0- 3-2 3 


MUT RIO ACID TES ST A 


srcA = Ard valA *— M valE = 10 

srcB = %rax valB «— e valE =3 
Figure 4.51 Pipelined execution of prog4 using forwarding. In cycle 4, the decode- 
stage logic detects a pending write to register %rdx in the memory stage. It also detects 
that a new value is being computed for register ¥rax in the execute stage. It uses these 


as the values for valA and valB rather than the values read from the register file. 


To exploit data forwarding to its full extent, we can also pass newly computed 
values from the execute stage to the decode stage, avoiding the need to stall for 
program prog4, as illustrated in Figure 4.51. In cycle 4, the decode-stage logic 
detects a pending write to register %rdx in the memory stage, and also that the | 
value being computed by the ALU in the execute stage will later be written to 
register Xrax. It can use the value in the memory stage (signal M. valE) for operand 
valA. It can also use the ALU output (signal € valE) for operand valB. Note that 
using the ALU output does not introduce any timing próblems. The decode stage 
only needs,to generate signals valA and valB by the-end of the clock cycle so that | 
pipeline register E can be loaded with the results from the decode stage as the 4 
clock rises to start the next cycle. The ALU output will be valid before this point. 

The uses of forwarding illustrated in programs prog2 to prog4 all involve 
the forwarding of values generated by the ALU and destined for write port E. 1 
Forwarding can also be used with values read from the memory and destined for $ 
write port M. From the memory stage, We can forward the value that has just been ? 
read from the data memory (signal m. valM). From the write-back stage, we can 
forward the pending write to port'M (signal W. valM). This gives a total of five 
different forwarding sources (e valE, m. valM, M valE, W valM, and W. valE) and 
two different forwarding destinations (valA and valB). 
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The expanded diagrams of Figures 4.49 to 4.51 also show how the decode- 
stage logic can determine whether to use a value from the register file or to use 
a forwarded value. Associated with every value that will be written back to the 
register file is the destination register ID. The logic can compare these IDs with 
the source register IDs srcA and srcB to detect a case for forwarding. It is possible 
to have multiple destination register IDs match one of the source IDs. We must 
establish a-priority among the different forwarding sources to handle such cases. 
This will be discussed when we look at the detailed design of the forwarding logic. 

Figure 4.52 shows the structure of PIPE, an extension of PIPE— that can 
handle data hazards by forwarding. Comparing this to the structure of PIPE— 
(Figure 4.41), we can see that the values from the five forwarding sources are fed 
back to the two blocks labeled “Sel+Fwd A” and “Fwd B" in the decode stage. 
The block labeled “Sel+Fwd A” combines the role of the block labeled “Select A” 
in RIPE— with the forwarding logic. It allows valA for pipeline register E to be 
either the incremented program counter valP, the value read from the A port 
of the register file, or one of the forwarded values. The block labeled *Fwd B" 
implements the forwarding logic for source operand valB. 


Load/Use Data Hazards 


One class of data hazards cannot be handled purely by forwarding, because mem- 
ory reads occur late in the pipeline. Figure 4.53 illustrates an example of a load/use 
hazard, where one instruction (the mrmovq at address 0x028) reads a value from 
memory for register 4xax while the next instruction (the addq at address 0x032) 
needs this value as a source operand. Expanded views of cycles 7 and 8 are shown 
in the lower part of the figure, where we assume all program registers initially have 
value 0. The addq instruction requires the value of the register in cycle 7, but it is 
not generated by the mrmovq instruction until cycle 8. In order to “forward” from 
the mrmovq to the addq, the forwarding logic would have to make the value go 
backward in time! Since this is clearly impossible, we must find some other mech- 
anism for handling this form of data hazard. (The data hazard for register Zrbx, 
with the value being generated by the irmovq instruction at address 0x01e and 
used by the addq instruction at address 0x032, can be handled by forwarding.) 

As Figure 4.54 demonstrates, we can avoid a load/use data hazard with a 
combination of stalling and forwarding. This requires modifications of the con- 
tro logic, but it can use existing bypass paths. As the mrmovq instruction passes 
through the execute stage, the pipeline control logic detects that the instruction 
in the decode stage (the addq) requires the result read from memory. It stalls the 
instruction in the decode stage for one cycle, causing a bubble to be injected into 
the execute stage. As the expanded view of cycle 8 shows, the value read from 
memory can then be forwarded from the memory stage to the addq instruction 
in the decode stage. The value for register %rbx is also forwarded from the write- 
back to the memory stage. As indicated in the pipeline diagram by the arrow from 
the box labeled "D" in cycle 7 to the box labeled “E” in cycle 8, the injected bub- 
ble replaces the addq instruction that would normally continue flowing through 
the pipeline. 








: Execute 


H imo. error j 1 
Instr. valid 1.......... 


{Fetch 





Figure 4.52 Hardware structure of PIPE, our final pipelined implementation. The 3 i 
additional bypassing paths enable forwarding the results from the three preceding $ 
instructions. This allows us to handle most forms of data hazards without stalling the 
pipeline. 
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* prog5 
Ox000: irmovq $128,/rdx 


Ox00a: irmovq $3,%rcx 


2 3 4 5 6 7 8 9 10 11 


1 
PTF LO [ve PM wel 
Pe [| E 











0x014: rmmovq "rcx, 0(%rdx) 
Ox01e: irmovq $10,%rbx 
0x028: mrmovq O(4rdx) ,frax # Load %rax 
0x032: addq %ebx,%eax # Use Yrax 
0x034: halt 





Bt it AN f 
M_dstM = %rax 
m. valM «—M[128] = 3 


ONT Ox UU RE A NUNTIAT ROTO IG 










M dstE = %rbx 


M valE = 10 E 







Error 





valA < M, valE = 10 
valB +— R[4rax] = 0 
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Figure 4.53 Example of load/use data hazard. The addq instruction requires the value 
of register rax during the decode stage in cycle 7. The preceding mrmovq reads a new 
value for this register during the memory stage in cycle 8, which is too late for the addq 
instruction. 


This use of a stall to handle a load/use hazard is called a load interlock. Load 
interlocks combined with forwarding suffice to handle all possible forms of data 
hazards. Since only load interlocks reduce the pipeline throughput, we can nearly 
achieve our throughput goal of issuing one new instruction on every clock cycle. 


Avoiding Control Hazards 


Control hazards arise when the processor cannot reliably determine the address 
of the next instruction based on the current instruction in the fetch stage. As 
was discussed in Section 4.5.4, control hazards can only occur in our pipelined 
processor for ret and jump instructions. Moreover, the latter case only causes dif- 
ficulties when the direction of a conditional jump is mispredicted. In this section, 
we provide a high-level view of how these hazards can be handled. The detailed 
implementation will be presented in Section 4.5.8 as part of a more general dis- 
cussion of the pipeline control. 

For the ret instruction; consider the following example program. This pro- 
gram is shown in assembly code, but with the addresses of the different instructions 
on the left for reference: 
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# progb5 1 2 3 
0x000: irmovq $128, %rdx 
Ox00a: irmovq $3,/Arcx 
0x014: rmmovq Xrcx, O(rdx) 








Ox01e: irmovq $10, %rbx 

0x028: mrmovg O(4rdx),%rax # Load Arax 
bubble 

0x032: addq %rbx, “rax # Use %rax 

0x034: halt 





Kie 


W_dstE = Zrbx 
W valE = 10 






han EA 


Inm. à 
p. 
M. dstM = rax 
m. valM «—M[128] = 3 1 
[BÓ LINER MERE AC Nj 





valA *— W valE = 10 
valB *— m valM = 3 


[NX eae ee MI OC MEME IIIS 27024 






Figure 4.54 Handling a load/use hazard by stalling. By stalling the addq instruction for one cycle in the 
decode stage, the value for valB can be forwarded from the mrmovq instruction in the memory stage to the 
addq instruction in the decode stage. 


0x000: irmovq stack,%rsp # Initialize stack pointer 
0x00a: call proc # Procedure call 

0x013: irmovq $10, %rdx # Return point 

Ox01d: halt 

0x020: .pos 0x20 

0x020: proc: # proc: 

0x020: ret # Return immediately 
0x021: rrmovq %rdx,4rbx # Not executed 

0x030: .pos 0x30 

0x030: stack: # stack: Stack pointer 


Figure 4.55 shows how we want the pipeline to process the ret instruction. i 
As with our earlier pipeline diagrams, this figure shows the pipeline activity with 
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# prog? 1 2 3 4 5 6 7 8 9 10 n 
0x000: irmovq Stack, %edx 
Ox00a: call proc 
0x020: ret 

bubble 

bubble 

bubble 
0x013: irmovq $10,Aedx # Return point 





Figure 455 Simplified view of ret instruction processing. The pipeline should stall while the ret passes 
through the decode, execute, and memory stages, injecting three bubbles in the process. The PC selection 
logic will choose the return address as the instruction fetch address once the ret reaches the write-back stage 
(cycle 7). ' 


time growing to the right. Unlike before, the instructions are not listed in the 
same order they occur in the program, since this program involves a control flow 
where instructions are not executed in a linear sequence. It is useful to look at the 
instruction addresses to identify the different instructions in the program. 

As this diagram shows, the ret instruction is fetched during cycle 3 and 
proceeds down the pipeline, reaching the write-back stage in cycle 7. While it 
passes through the decode, execute, and memory stages, the pipeline cannot do 
any useful activity. Instead, we want to inject three bubbles into the pipeline. Once 
the ret instruction reaches the write-back stage, the PC selection logic will set the 
program counter to the return address, and therefore the fetch stage will fetch the 
irmovq instruction at the return point (address 0x013), 

To handle a mispredicted branch, consider the following program, shown in 
assembly code but with the instruction addresses shown on the left for reference: 


0x000: xorg %rax,%rax 

0x002: jne target # Not taken 
0x00b: irmovq $1, %rax f Fall through 
0x015: halt 

0x016: target: 

0x016: irmovq $2, %rdx # Target 

0x020: irmovq $3, %rbx # Target+i 
0x02a: halt 


Figure 4.56 shows how these instructions are processed. As before, the instruc- 
tions are listed in the order they enter the pipeline, rather than the order they occur 
in the program. Since the jump instruction is predicted as being taken, the instruc- 
tion at the jump target will be fetched in cycle 3, and the instruction following this 
one will be fetched in cycle 4. By the time the branch logic detects that the jump 
should not be taken during cycle 4, two instructions have been fetched that should 
not continue being executed. Fortunately, neither of these instructions has caused 
a change in the programmer-visible state. That can only occur when an instruction 
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# prog? 1 2 3 4 5 6 7 8 9 10 
0x000: xorq %rax, (rax 


0x002: jne target # Not taken 






0x016: irmovl $2,%rdx # Target 
bubble 

0x020: irmovl $3,%rbx # Target+1 
bubble 

Ox00b: irmovg $1,/rax # Fall through 

0x015: halt 





Figure 4.56 Processing mispredicted branch instructions. The pipeline predicts 
branches will be taken and so starts fetching instructions at the jump target. Two 
instructions are fetched before the misprediction is detected in cycle 4 when the jump 
instruction flows through the execute stage. In cycle 5, the pipeline cancels the two 
target instructions by injecting bubbles into the decode and execute stages, and it also 
fetches the instruction following the jump. i 


reaches the execute stage, where it can cause the condition codes to change. At 
this point, the pipeline can simply cancel (sometimes called instruction squashing) 
the two misfetched instructions by injecting bubbles into the decode and execute 
stages on the following cycle while also fetching the instruction follówing the jump 
instruction. The two misfetched instructions will then simply disappear from the 
pipeline and therefore not have any effect on the programmer-visible state. The 
only drawback is that two clock cycles’ worth of instruction processing capability 
have been wasted. 

This discussion of control hazards. indicates that they can be handled by 
careful consideration of the pipeline cóntrol logic. Techniques such as stalling 
and injecting bubbles into the pipeline dynamically adjust the pipeline flow when 
special conditions arise. As we will discuss in Section 4.5.8, a simple extension to 
the basic clocked register design will enable us to stall stages and to inject bubbles ; 
into pipeline registers as part of the pipeline control logic. 1 


4.5.6 Exception Handling 


As we will discuss in Chapter 8, a variety of activities in a processor can lead | 
to exceptional control flow, where the normal chain of program execution gets 4 
broken. Exceptions can be generated either internally, by the executing program, § 
or externally, by some outside signal. Our instruction set architecture includes $ 
three different internally generated exceptions, caused by (T) a halt instruction, 

(2) an instruction with an invalid combination of instruction and function code, | 
and (3) an attempt to access an invalid address, either for instruction fetch or | 
data read or write. A more complete processor design would also handle external. 

exceptions, such as when the processor receives a signal that the network interface 4 
has received a new packet'or the user“has clicked a mouse button. Handling 
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exceptions torrectly is a challenging aspect of any microprocessor design. They can 
occur at unpredictable times, and they require creating a clean break in the flow 
of instructions through the processor pipeline. Our handling of the three internal 
exceptions gives just a glimpse of the true complexity of correctly detecting and 
handling exceptions. : g 

Let us refer to the instruction causing the exception as the excepting instruc- 
tion. In the case of an invalid instruction address, there is no actual excepting 
instruction, but.it is useful to think of there being a sort of “virtual instruction” 
at the invalid address. In our simplified ISA model, we want the processor to halt 
when itreaches ari exception and to set the appropriate status code, as listed in Fig- 
ure 4.5. It should appear that all instructions up to the excepting instruction have 
completed, but none of the following instructions should have any effect on the 
programmer-visible state. In a‘more complete design, the processor would con- 
tinue by invoking an exception handler, a procedure that is part of the operating 
system, but implementing this part of exception handling is beyond the scope of 
our presentation. 

In a pipelined system; exception handling involves several subtleties, First, it is 
possible to have exceptions triggered by multiple instructions simultaneously. For 
example, during one cycle óf pipeline operation, we could have a halt instruction 
in the fetch stage, and the data memory could report an out-of-bounds data 
address‘for the instruction in the memory stage. Wemust determine which of these 
exceptions the processor should réport to the operating system. The basic rule is to 
put priority on the exception triggered by the instruction that is furthest along the 
pipeline- In the example above, this would be the out-of-bounds address attempted 
by the instruction in the memory stage. In terms of the machine-language program, 
the instruction in the memory stage should appear to execute before one in the 
fetch stage, and therefore only this exception should be reported to the operating 
system: 

A Second subtlety occurs when an instruction is first fetched and begins 
execution, causes an exception, and later is canceled due to a mispredicted branch. 
The following is an example of such a program in its object-code form: 


0x000} 6300 l xorg %rax,%rax 
0x002: 741600000000000000 | jne target # Not taken 


0x00b: 30f00100000000000000 | irmovq $1, %rax # Fall through 

0x015: 00 | halt 

0x016: | target: 

Ox016: ff | - byte, OxFF # Invalid instruction code 


In this program, the pipeline will predict that the branch should be taken, 
and so it will fetch and attempt to use a byte with value OxFF as an instruction 
(generated in the assembly code using the .. byte directive). The decode stage will 
therefore detect an invalid instruction exception. Later, the pipeline will discover 
that the branch should not be taken, and so the instruction at address 0x016 
should never even have been fetched. The pipeline control logic will cance] this 
instruction, but we want to avoid raising an exception. 
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A third subtlety arises because a pipelined processor updates different parts 
of the system state in different stages. It is possible for an instruction following 
one causing an exception to alter some part of the state before the excepting 
instruction completes. For example, consider the following code sequence, in 
which we assume that user programs are not allowed to access addresses at the 
upper end of the 64-bit range: 


1 irmovq $1, %rax 

2 xorq Arsp,4rsp # Set stack pointer to O and CC to 100 

3 pushq %rax # Attempt to write to Oxfffffffffffffff8 

4 addq ‘%rax,%rax # (Should not be executed) Would set CC to 000 


The pushq instruction causes an address exception, because decrementing the 
stack pointer causes it to wrap around to Ox£f£fffffffffffff8. This exception 
is detected in the memory stage. On the same cycle, the addq instruction is in 
the execute stage, and it will cause the condition codes to be set to new values. 
This would violate our requirement that none of the instructions following the 
excepting instruction should have had any effect on the system state.. 

In general, we can both correctly choose among the different exceptions and 
avoid raising exceptions for instructions that are fetched due to mispredicted 
branches by merging the exception-handling logic into the pipeline structure. 'That 
is the motivation for us to include a status code stat in each of our pipeline registers 
(Figures 4.41 and 4.52). If an instruction generates an exception at some stage in 
its processing, the status field is set to indicate the nature of the exception. The 
exception status propagates through the pipeline with the rest of the information 
for that instruction, until it reaches the write-back stage. At this point, the pipeline 
control logic detects the occurrence of the exception and stops execution. 

To avoid having any updating of the programmer-visible state by instructions 
beyond the excepting instruction, the pipeline control logic must disable any 
updating of the condition code register or the data memory when an instruction in 
the memory or write-back stages has caused an exception. In the example program 
above, the control logic will detect that the pushq in the memory stage has caused 
an exception, and therefore the updating of the condition code register by the 
addq instruction in the execute stage will be disabled. 

Let us consider how this method of handling exceptions deals with the sub- 
tleties we have mentioned. When an exception occurs in one or more stages of a 
pipeline, the information is simply stored in the status fields of the pipeline reg- 


isters. The event has no effect on the flow of instructions in the pipeline until an 4 


excepting instruction reaches the final pipeline stage, except to disable any updat- 


ing of the programmer-visible state (the condition code register and the memory) 1 
by later instructions in the pipeline. Since instructions reach the write-back stage % 


in the same order as they would be executed in anonpipelined processor, we are $E 
guaranteed that the first instruction encountering an exception will arrive first in 3 
the-write-back stage, at which point program execution can stop and the status 
code in pipeline register W can be recorded as the program status. If some in- 
struction is fetched but later canceled, any exception status information about the 4 
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, instruction gets canceled as well. No instruction following one that causes an ex- 
ception can alter the programmer-visible state. The simple rule of carrying the 

, exception status together with all other information about an instruction through 
the pipeline provides a simple and reliable mechanism for handling exceptions. 


4.5.7 PIPE Stage Implementations 


We have now created an overall structure for PIPE, our pipelined Y86-64 proces- 
' sor with forwarding. It uses the same set of hardware units as the earlier sequential 

designs, with the addition of pipeline registers, some reconfigured logic blocks, and 
additional pipeline control logic. In this section, we go through the design of the 
different logic blocks, deferring the design of the pipeline control logic to the next 
section: Many of the logic blocks are identical to their counterparts in SEQ and 
SEQ+, except that we must choose proper versions of the different signals from 
the pipeline registers (written with the pipeline register name, written in upper- 
case, as a prefix) or from the stage computations (written with the first character 
of the stage name, written in lowercase, as a prefix). 

As an example, compare the HCL code for the logic that generates the srcA 
signal in SEQ to the corresponding code in PIPE: 

# Code from SEQ 


word srcA = [ 
icode in { IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : rA; 
icode in ( IPOPQ, IRET ) : RRSP; 
1 : RNONE; # Don't need register 


# Code from PIPE 


word d srcA = [ 
,D-icode in ( IRRMOVQ, IRMMOVQ, IOPQ, IPUSHQ } : D r4; 
D_icode in ( IPUPQ, IRET } : RRSP; 
1: RHONE * Don't need register 

]; 


They differ only in the prefixes added to the PIPE signals: D_ for the source 
values, to indicate that the signals come from pipeline register D, and &_ for the 
result value, to indicate that it is generated in the decode stage. To avoid repetition, 
we will not show the HCL code here for blocks that only differ from those in SEQ 
because of the prefixes on names. As a reference, the complete HCL code for 
PIPE is given in Web Aside AncH:HCL on page 472. 


PC Selection and Fetch Stage 


Figure 4.57 provides a detailed view of the PIPE fetch stage logic. As discussed 
earlier, this stage must also select a current value for the program counter and 
predict the next PC value. The hardware units for reading the instruction from 
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Figure 4.57 PIPE PC selection and fetch logic. Within the one cycle time limit, the 


| | 

| t i processor can only predict the address of the next instruction. 

EN c 

| | memory and for extracting the different instruction fields are the same as those 
i we considered for SEQ (see the fetch stage in Section 4.3.4). 


The PC selection logic chooses between three program counter sources. As a 

mispredicted branch enters the memory stage, the value of valP for this instruction 

k : (indicating the address of the following instruction) is read from pipeline register 4 

i M (signal M valA). When a ret instruction enters the write-back stage, the return | 

: address is read from pipeline register W (signal W. valM). All other cases use the j 
predicted value of the PC, stored in pipeline register F (signal F predPC): 


word f pc = [ 
s # Mispredicted branch. Fetch at incremented PC 
M_icode == IJXX && !M Cnd : M.val4; 
# Completion of RET instruction 
W icode == IRET : W_valM; 
# Default: Use predicted value of PC 
1 : F. predPC; 
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The PC prediction logic chooses valC for the fetched instruction when it is 
either a call or a jump, and valP otherwise: 


word f_predPC = [ 
f_icode in ( IJXX, ICALL } : f_valc; 
T : f_valP; 

l; 


The logic blocks labeled “Instr valid,” “Need regids,” and “Need valC” are 
the same as for SEQ, with appropriately named source signals. 

Unlike in SEQ, we must split the computation of the instruction status into 
two parts. In the fetch stage, we can test for a memory error due to an out-of-range 
instruction address, and we can detect an illegal instruction or a halt instruction. 
Detecting an invalid data address must be deferred to the memory stage. 





Write HCL code for the deni f stat, pantie the eros status for the 
fetched instruction. 


Decode and Write-Back Stages 


Figure 4.58 gives a detailed view of the decode and write-back logic for PIPE. The 
blocks labeled dstE, dstM, srcA, and srcB are very similar to their counterparts 
in the implementation of SEQ. Observe that the register IDs supplied to the 
write ports come from the write-back stage (signals W_dstE and W dstM), rather 
than from the decode stage. This is because we want the writes to occur to the 
destination registers specified by the instruction in the write-back stage. 





The Block labeled "dtEM i in the decd stage generates the register ID for the. E 
port of the register file, based on fields from the fetched instruction in pipeline 
register D. The resulting signal is named d dstE in the HCL description of PIPE. 
Write HCL code for this signal, based on the HCL description of the SEQ signal 
dstE. (See the decode stage for SEQ in Section 4.3.4.) Do not concern yourself 
with the logic to implement conditional moves yet. 


Most of the complexity of this stage is associated with the forwarding logic. 
As mentioned earlier, the block labeled “Sel+Fwd A” serves two roles. It merges 
the valP signal into the valA signal for later stages in order to reduce the amount 
of state in the pipeline register. It also implements the forwarding logic for source 
operand valA. 

The merging of signals valA and valP exploits the fact that only the call and 
jump instructions need the value of valP in later stages, and these instructions 
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Figure 4.58 PIPE decode and write-back stage logic. No instruction requires both valP and the value read 
from register port A, and so these two can be merged to form the signal valA for later stages. The block labeled j 
“Sel4Fwd A" performs this task and also implements the forwarding logic for source operand valA. The block : 
labeled "Fwd B" implements the forwarding logic for source operand valB. The register write locations are 
specified by the dstE and dstM signals from the write-back stage rather than from the decode stage, since it. | 
is writing the results of the instruction currently in the write-back stage. 


do not need the value read from the A port of the register file. This selection is | 
controlled by the icode signal for this stage. When signal D. icode matches the | 
instruction code for either call or jXX, this block should select D_valP-as its § 
output. 

As mentioned in Section 4.5.5, there are five different forwarding sources, 4 
each with a data word and a destination register ID: 
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Data word RegisterID Source description 


e valE e dstE ALU output 

m. valM M dstM Memory output 

M, valE M dstE Pending write to port E in memory stage 

W vaiM W_dstM Pending write to port M in write-back stage 
W_valE W_dstE Pending write to port E in write-back stage 


If none of the forwarding conditions hold, the block should select d_rva lA, the 
value read from register port A, as its output. 

Putting all of this together, we get the following HCL description for the new 
value of valA for pipeline register E: 


word d_valaA = [ 
D_icode in { ICALL, IJXX } : D.valP; # Use incremented PC 


d_srcA == e dstE : e. valE; # Forward valE from execute 
d.srcA == M dstM : m valM; # Forward valM from memory 
d srcA == M dstE : M, valE; # Forward valE from memory 
Q.srcA == W dstM : W. valM; * Forward valM from write back 
d_srcA == W dstE : W valE; * Forward valE from write back 


1: d rvalà; # Use value read from register file 


3 

The priority giyen to the five forwarding sources in the above HCL code is 
very important. This priority is determined in the HCL code by the order in which 
the five destination register IDs are tested. If any order other than the one shown 
were chosen, the pipeline would behave incorrectly for some programs. Figure 4.59 
shows an example of a program that requires a correct setting of priority among 
the forwarding sources in the execute and memory stages. In this program, the 
first two insfructions write to register %rdx, while the'third uses this register as its 
source óperand. When the rrmovq instruction reaches the decode stage in cycle 
4, the forwarding logic must choose between two values destined for its source 
register. Which one should it choose? To set the priority, we must consider the 
behavior of the machine-language program when it is executed one instruction 
at a time. The first irmovq instruction would set register 4rdx to 10, the second 
would set the register to 3, and then the rrmovq instruction would read 3 from 
%rdx. To imitate this behavior, our pipelined implementation should always give 
priority to the forwarding source in the earliest pipeline stage, since it holds the 
latest instruction in the program sequence setting the register. Thus, the logic in the 
HCL code above first tests the forwardin gsource in the execute stage, then those in 
the memory stage, and finally the sources in the write-back stage. The forwarding 
priority between the two sources in either the memory or the write-back stages 
is only a concern for the instruction popq %rsp, since only this instruction can 
attempt two simultaneous writes to the same register. 
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# prog8 

0x000: irmovq $10, %rdx 
Ox00a: irmovg $3, %rdx 
0x014; rrmovq frdx,%rax 
0x016: halt 


mm 





= 


this register. 


Au. ee oe ae ee 


popq rsp 


aA aA Ww NN - 





irmovq $5, %rdx 
irmovq $0x100,%rsp 
rmmovq Z%rdx,0C%rsp)  ., 


rrmovq Arsp,Wrax 


M. dstE = %rdx 
M valE = 10 











E dstE = %rdx 
e valE *- 0-4 3-3 
NC d£ e: ies 


Ra o PVT A 
| valA e e ValE =3 
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Figure 4.59 Demonstration of forwarding priority. in cycle 4, values for %rax are 
available from both the execute and memory stages. The forwarding logic should choose 
the one in the execute stage, since it represents the most recently generated value for 





| Suppose the ordet of the third and fourth cases (the two foiwardipg sources 
from the memory Stage) in the HCL code for d_valA were reversed. Describe the 
resulting behavior of the rrmovq instruction (line 5) for the following program: 





Suppose the ordet of the fifth and ens Cases (the t two forwarding sources s fom 
the write-back stage) in the HCL code for d valA were reversed. Write a Y86-64 
' program that would be executed incorrectly. Describe how the error would occur 
and its effect on the program behavior. 
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Write HCL code for the signal d valB, giving the value for source operand valB 

supplied to pipeline register E. 








One small part of the write-back stage remains: As shown in Figure 4.52, the 
overall processor status Stat is computed by. a block based on the status value in 
pipeline register W. Recall from Section 4.1.1 that the code should indicate either 
normal operation (AOK) or one of the three exception conditions. Since pipeline 
register W holds the state of the most recently completed instruction, it is natural 
to use this value as an indication of the overall processor status, The only special 
case to consider is when there is a.bubble in the write-back stage. This is part of 
normal operation, and so we want the status code to be AOK for this case as well: 


word Stat = [ 
W_stat == SBUB : SADK; 
1: W.stat; 

1; 


Execute Stage 


Figure 4.60 shows the execute stage logic for PIPE. The hardware units and the 
logic blocks are identical to thoge in SEQ, with an appropriate renaming of signals. 
We can see the signals e_valE and e_dstE directed toward the decode stage as 
one of the forwarding sources. One difference is that the logic labeled “Set CC,” 
which determines whether or not to update the condition codes, has signals m_stat 
and W_stat as inputs. These signals are used to detect cases where an instruction 


9 valE 
9 dstE 





Figure 4.60 PIPE execute stage logic. This part of the design is very similar to the logic 
in the SEQ implementation. 
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W_valE 
W. valM 
W dstE 
W. dstM 


Figure 4.61 PIPE memory stage logic. Many of the signals from pipeline registers 
M and W are passed down to earlier stages to provide write-back results, instruction 
addresses, and forwarded results. 


causing an exception is passing through later pipeline stages, and therefore any 
updating of the condition codes should be suppressed. This aspect of the design is 
discussed in Section 4.5.8. 


Practice r nee 435739] lon DA "9T Ni» > T MM Hm E AE ME, 
Our second case in the HCL code for d valA uses signal e ste t to see whether 
to select the ALU putput e valE as the forwarding source. Suppose instead that 
we use signal E dstE, the destination register, ID in pipeline register E for this 
selection. Write a Y86-64 program that would give an incorrect result with this 
modified forwarding logic. 


Memory Stage 


Figure 4.61 shows the memory stage logic for PIPE. Comparing this to the memory 
stage for SEO (Figure 4.30), we see that, as noted before, the block labeled “Mem. 
data" in SEQ is not present in PIPE. This block served to select between data 
sources valP (for call instructions) and valA, butthis selection is now performed 
by the block labeled “Sel+Fwd A" in the decode stage. Most other blocks in this 
stage are identical to their counterparts in SEQ, with an appropriate | renaming 
of the signals. In this figure, you can also see that many of the values in-pipeline 
registers and M arid W are supplied to' other parts of the circuit'as part of the 
forwarding aid pipeline control logic. 
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In this stage, we can complete the computation of the status code Stat by detecting 
the case of an invalid address for the data memory. Write HCL code for the signal 
m stat. 








4.5.8 Pipeline Control Logic 


Wearenow ready to complete our design for PIPE by creating the pipeline control 
logic. This logic must handle the following four control cases for which other 
mechanisms, such as data forwarding and branch prediction, do not suffice: 


Load/use hazards. The pipeline must stall for one cycle between an instruction 
that reads a value from memory and an instruction that uses this value. 


Processing ret. The pipeline must stall until the ret instruction reaches the 
write-back stage. 


Mispredicted branches. By the time the branch logic detects that a jump should 
not have been taken, several instructions at the branch target will have 
started down the pipeline. These instructions must be canceled, and fetch- 
ing should begin at the instruction following the jump instruction. 


Exceptions. When an instruction causes an exception, we want to disable the 
updating of the programmer-visible state by later instructions and halt 
execution once the excepting instruction reaches the write-back stage. 


We will go through the desired actions for each of these cases and then develop 
control logic to handle all of them. 


Desired Handling of Special Control Cases 


For a load/use hazard, we have described the desired pipeline operation in Section 
4.5.5, as illustrated by the example of Figure 4.54. Only the mrmovq and popq 
instructions read data from memory. When (1) either of these is in the execute 
stage and (2) an instruction requiring the destination register is in the decode 
stage, we want to hold back the second instruction in the decode stage and inject a 
bubble into the execute stage on the next cycle. A fter this, the forwarding logic will 
resolve the data hazard. The pipeline can hold back an instruction in the decode 
stage by keeping pipeline register D in a fixed state. In doing so, it should also 
keep pipeline register F in a fixed state, so that the next instruction will be fetched 
asecond time. In summary, implementing this pipeline flow requires detecting the 
hazard condition, keeping pipeline registers F and D fixed, and injecting a bubble 
into the execute stage. 

For the processing of a ret instruction, we have described the desired pipeline 
operation in Section 4.5.5. The pipeline should stall for three cycles until the 
return address is read as the ret instruction passes through the memory stage. 








456 Chapter4 Processor Architecture 


This was illustrated by a simplified pipeline diagram in Figure 4.55 for processing 
the following program: 


0x000: irmovq stack,%rsp # Initialize stack pointer 
0x00a: call proc # Procedure call 

0x013: irmovq $10,%rdx # Return point 

Ox01d: halt 

0x020: .pos 0x20 

0x020: proc: # proc: 

0x020: ret # Return immediately 
0x021: rrmovg %rdx,4rbx # Not executed 

0x030: .pos 0x30 

0x030: stack: # stack: Stack pointer 


Figure 4.62 provides a detailed view of the processing of the ret instruction 
for the example program. The key observation here is that there is no way to 
inject a bubble into the fetch stage of our pipeline. On every cycle, the fetch stage 
reads some instruction from the instruction memory. Looking at the HCL code 
for implementing the PC prediction logic in Section 4.5.7, we can see that for the 
ret instruction, the new value of the PC is predicted to be valP, the address of the 
following instruction. In our example program, this would be 0x021, the address 
of the rrmovq instruction following the ret: This prediction is not correct for this 
example, nor would it be for most cases, but we are not attempting to predict return 
addresses correctly in our design. For three clock cycles, the fetch stage stalls, 
causing the rrmovq instruction to be fetched but then replaced by a bubble in the 
decode stage. This process is illustrated in Figure 4.62 by the three fetches, with an 
arrow leading down to the bubbles passing through the remaining pipeline stages. 
Finally, the irmovg instruction is fetched on cycle 7. Comparing Figure 4.62 with 


# prog6 

0x000: irmovq Stack, %rsp 

0x00a: call proc 

0x020: ret 

0x021: rrmovq {rdx,4rbx # Not executed 
bubble 

0x021: rrmovq Ardx,Arbx # Not executed 
bubble 

: rrmovq %rdx,4rbx # Not executed 

bubble 
irmovq $10, Xrdx # Return point 


Figure 4.62 Detailed processing of the ret instruction. The fetch stage repeatedly 
fetches the rrmovq instruction following the ret instruction, but then the pipeline 
control logic injects a bubble into the decode stage rather than allowing the rrmovq 
instruction to proceed. The resulting behavior is equivalent to that shown in Figure 4.55, | 
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Figure 4.55, we see that our implementation achieves the desired effect, but with 
a slightly peculiar fetching of an incorrect instruction for three consecutive cycles. 

When a mispredicted branch occurs, we have described the desired pipeline 
operation in Section 4.5.5 and illustrated it in Figure 4.56. The misprediction will 
be detected as the jump instruction reaches the execute stage. The control logic 
then injects bubbles into the decode and execute stages on the next cycle, causing 
the two incorrectly fetched instructions to be canceled. On the same cycle, the 
pipeline reads the correct instruction into the fetch stage. 

For an instruction that causes an exception, we must make the pipelined im- 
plementation match the desired ISA behavior, with all prior instructions complet- 
ing and with none of the following instructions having any effect on the program 
state. Achieving these effects is complicated by the facts that (1) exceptions are 
detected during two different stages (fetch and memory) of program execution, 
and (2) the program state is updated in three different stages (execute, memory, 
and write-back). 

Our stage designs include a status code stat in each pipeline register to track 
the status of each instruction as it passes through the pipeline stages. When an 
exception occurs, we record that information as part of the instruction’s status and 
continue fetching, decoding, and executing instructions as if nothing were amiss. 
As the excepting instruction reaches the memory stage, we take steps to prevent 
later instructions from modifying the programmer-visible state by (1) disabling 
the setting of condition codes by instructions in the execute stage, (2) injecting 
bubbles into the memory stage to disable any writing to the data memory, and (3) 
stalling the write-back stage when it has an excepting instruction, thus bringing 
the pipeline to a halt. 

The pipeline diagram in Figure 4.63 illustrates how our pipeline control han- 
dles the situation where an instruction causing an exception is followed by one that 
would change the condition codes. On cycle 6, the pushq instruction reaches the 
memory stage and generates a memory error. On the same cycle, the addq instruc- 
tion in the execute stage generates new values for the condition codes. We disable 
the setting of condition codes when an excepting instruction is in the memory or 
write-back stage (by examining the signals m stat and W. stat and then setting the 
signal set cctozero). We can also see the combination of injecting bubbles into the 
memory stage and stalling the excepting instruction in the write-back stage in the 
example of Figure 4.63—the pushq instruction remains stalled in the write-back 
stage, and none of the subsequent instructions get past the execute stage. 

By this.combination of pipelining the status signals, controlling the setting of 
condition codes, and controlling the pipeline stages, we achieve the desired behav- 
ior for exceptions: all instructions prior to the excepting instruction are completed, 
while none of the following instructions has any effect on the programmer-visible 
State. 


Detecting Special Control Conditions 


Figure 4.64 summarizes the conditions requiring special pipeline control. It gives 
expressions describing the conditions under which the three special cases arise. 





# prog10 1 2 3 4 5 6 7 8 9 10 
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0x000: irmovq $1,%rax 
Ox00a: xorq Arsp,Arsp #CC = 100 
| OxO00c: pushq Arax 
| Ox00e: addq %rax, rax 


0x010: irmovq $2,%rax 









New CC = 000 3 


Figure 4.63 Processing invalid memory reference exception. On cycle 6, the invalid 
memory reference by the pushq instruction causes the updating of the condition codes 
to be disabled. The pipeline starts injecting bubbles into the memory stage and stalling 
the excepting instruction in the write-back stage. 








Condition Trigger 

Processing ret IRET € (D.icode, E.icode, M.icode) 

Load/use hazard E.icode € {IMRMOVQ, IPOPQ} && E dstM e (d.srcA, d. srcB) 
Mispredicted branch —— E.icode = I[XX && !e.Cnd 

Exception m.stat € (SADR, SINS, SHLT) | | W.stat € (SADR, SINS, SHLT) 


Figure 4.64 Detection conditions for pipeline control logic. Four different conditions 
require altering the pipeline flow by either stalling the pipeline or canceling partially 
executed instructions. 


These expressions are implemented by simple blocks of combinational logic that 
must generate their results before the end of the clock cycle in order to control 
| the action of the pipeline registers as the clock rises to start the next cycle. During 
a clock cycle, pipeline registers D, E, and M hold the states of the instructions 
! that are in the decode, execute, and memory pipeline stages, respectively. As 
we approach the end of the clock cycle, signals d srcA and d srcB will be set to 
the register IDs of the source operands for the instruction in the decode stage. 
Detecting a ret instruction as it passes through the pipeline simply involves 
checking the instruction codes of the instructions in the decode, execute, and 
memory stages. Detecting a load/use hazard involves checking the instruction 
type (nrmovq or popq) of the instruction in the execute stage and comparing its 
destination register with the source registers of the instruction in the decode stage. 
The pipeline control logic should detect a mispredicted branch while the jump 
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instruction is in the execute stage, so that it can set up the conditions required to 
recover from the misprediction as the instruction enters the memory stage. When a 
jump instruction is in the execute stage, the signal e Cnd indicates whether or not 
the jump should be taken. We:detect an excepting ihstruction by examining the 
instructior status values in the memory and write-back stages. For the memory 
stage, we use the signal m stat, computed within the stage, rather than M stat 
from the pipeline register. This internal signal incorporates the possibility of a 
data memory'address error. 


Pipeline Control Mechanisms 


Figure 4.65 shows-low-level mechanisms that allow the pipeline eontrol logic to 
hold back an instruction in a pipeline'register or to inject a bubble into the pipeline. 
These mechánisms involve small extensions to thé basic clocked register described 


r 


State = x 


‘Rising 


cclock , 











bubble 
=0 


(b) Stall +s 
State = x State = nop 


Rising’ Output = nop 
clock 


qo oP 





(c) Bubble 


Figure 4:65  Additional.pipeline register operations. (a) Under normal conditions, 
the state and output of the register are set to the value at the input when the clock rises. 
(b), When operated in, stall mode, the state is,held fixed-at its previous value. (c) When 
operated in bubble mode, the state js overwritten with that,of a nop operation. 
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Pipeline register 





Condition F " D E M wW 
Processing ret stail bubble normal normal normal 
Load/use hazard stall stall bubble normal normal 


Mispredicted branch normal bubble bubble normal normal 


Figure 4.66 Actions for pipeline control logic. The different conditjons require altering 
the pipeline flow by either stalling the pipeline or canceling partially executed instructions. 


in Section 4.2.5. Suppose that each pipeline register has two control inputs stall 
and bubble. The settings of these signals determine how the pipeline register is 
updated as the clock rises. Under normal operation (Figure 4.65(a)), both of these 
inputs are set to 0, causing the register to load its input as its new state. When the 
stall signal is set to 1 (Figure 4.65(b)), the updating of the state is disabled. Instead, 
the register will remain in its previous state. This makes it possible to hold back 
an instruction in some pipeline stage. When the bubble signal is set to 1 (Figure 
4.65(c)), the state of the register will be set to some fixed reset configuration, giving 
a state equivalent to that of a nop instruction. The particular pattern of ones and 
zeros for a pipeline register's reset configuration depends on the set of fields in 
the pipeline register. For example, to inject a bubble into pipeline register D, we 
want the icode field to be set to the constant value INOP (Figure 4.26). To inject 
a bubble into pipeline register E, we want the icode field to be set to INOP and 
the dstE, dstM, srcA, and srcB fields to be set to the constant RNONE. Determining 
the reset configuration is one of the tasks for the hardware designer in designing 
a pipeline register. We will not concern ourselves with the details here. We will 
consider it an error to set both the bubble and the stall signals to 1. i 

The table in Figure 4.66 shows the actions the different pipeline stages should 
take for each of the three special conditions. Each involves some combination of 
normal, stall, and bubble operations for the pipeline registers. In terms of timing, 
the stall and bubble control signals for the pipeline registers are generated by 
blocks of combinational logic. These values must be valid as the clock rises, causing 
each of the pipeline registers to either load, stall, or bubble as the next clock cycle 
begins. With this small extension to the pipeline register designs, we can implement 
a complete pipeline, including all of its control, using the basic building blocks of 


i 


combinational logic, clocked registers, and random access memories. 


Combinations of Control Conditions 


In our discussion of the special pipeline control conditions so far, we assumed that 
at most one special case could arise during any single clock cycle. A common bug in 
designing a system is to fail to handle instances where multiple special conditions 
arise simultaneously. Let us analyze such possibilities. We need not worry abqut 
combinations involving program exceptions, since we have carefully designed 
our exception-handling mechanism to'consider other instructions in the pipeline. 
Figure 4.67 diagrams the pipeline states that cause the other three special corítrol 
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Figure 4.67 Load/use Mispredict ret 1 ret 2 ret 3 

Pipeline states for special 

control conditions. The T lga T PESEN earo n - bubbles 
two pairs indicated can |- v4 e Yet: * bubble" 





arise simultaneously. , 1 Combination A 


Combination B 


conditions. These diagrams show blocks for the decodé, execute, and memory 
stages. The shaded boxes represent particular constraints that must be satisfied 
for the condition to arise. A load/use hazard requires that the instruction in the 
execute stage reads a value from memory into a register, and that the instruction 
in the decode stage has this register as a source,operand. A mispredicted branch 
requires the instruction in the execute stage to have a jump instruction. There are 
three possible cases for ret—the instruction can be in either the decode, execute, 
or memory stage. As the,ret instruction moves through the pipeline, the earlier 
pipeline stages will have bubbles. 

We can see by these diagrams that most of the control conditions are mutually 
exclusive. For example, it is not possible to have a load/use hazard and a mispre- 
dicted branch simultaneously, since one requires a load instruction (mrmovq or 
popq) in the;execute stage, while the, other requires a jump. Similarly, the second 
and third ret combinations cannot occur at the same time as a load/use hazard or 
a mispredicted branch. Only the two combinations indicated by arrows can arise 
simultaneously. 

Combination A involves a not-taken jump instruction in the execute stage and 
a ret instruction in the decode stage. Setting up this combination requires the ret 
to.be at the target of a not-taken branch. The pipeline control logic should detect 
that the branch was mispredicted and therefore cancel the ret instruction. 
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Write : a Y86-64 assembly- Jangnsge f program that causes combination A to arise 
and determines whether the control logic handles it correctly. 


F Li 





Combining the control actions for the combination A conditions (Figure 4 66), 
we get the following pipeline control actions (assuming that either a bubble or a 
stall overrides the normal case): 


Pipeline register 








Condition F D E M W 
Processing ret stall bubble normal normal normal 


Mispredicted branch normal bubble bubble normal normal 


Combination stall bubble bubble normal normal 
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That is, it would be handled like a mispredicted branch, but with a stall in the 
fetch stage. Fortunately, on the next cycle, the PC selection logic will choose the 
address of the instruction following the jump, rather than the predicted program 
counter, and so it does not matter what happens with the pipeline register F. We 
conclude that the pipeline will correctly handle this combination. 

Combination B involves a load/use hazard, where the loading instruction sets 
register 4rsp and the ret instruction then uses this register as a source operand, 
since it must pop the return address from the stack. The pipeline control logic 
should hold back the ret instruction in the decode stage. 





and completes with a halt instruction if the pipeline operates correctly. 


Combining the control actions for the combination B conditions (Figure 4.66), 
we get the following pipeline control actions: 


Pipeline register 
Condition F D E M WwW 
Processing ret stall bubble normal normal normal 
Load/use hazard stall stall bubble normal normal 
Combination stall bubble--stall bubble normal normal 
Desired stall stall bubble normal normal 


Ifboth sets of actions were triggered, the control logic would try to stall the ret 
instruction to avoid the load/use hazard but also inject a bubble into the decode 
stage due to the ret instruction. Clearly, we do not want the pipeline to perform 
both sets of actions. Instead, we want it to just take the actions for the load/use 
hazard. The actions for processing the ret instruction should be delayed for one 
cycle. 

"This analysis shows that combination B requires special handling. In fact, our 
original implementation of the PIPE control logic did not handle this combination 
correctly. Even though the design had passed many simulation tests, it had a subtle 
bug that was uncovered only by the analysis we have just shown. When a program 
having combination B was executed, the control logic would set both the bubble 
and the stall signals for pipeline register D to 1. This example shows the importance 
of systematic analysis. It would be unlikely to uncover this bug by just running 
normal programs. If left undetected, the pipeline would not faithfully implement 
the ISA behavior. 


Control Logic Implementation 


Figure 4.68 shows the overall structure of the pipeline control logic. Based on 
signals from the pipeline registers and pipeline stages, the control logic generates 
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Figure 4.68 PIPE pipeline control logic. This logic overrides the normal flow of instructions through the 
pipeline to handle special conditions such as procedure returns, mispredicted branches, load/use hazards, 
and program exceptions. 


TE 





stall and bubble control signals for the pipeline registers and also determines 
whether the condition code registers should be updated. We can combine the 
detection conditions of Figure 4.64 with the actions of Figure 4.66 to create HCL 
descriptions for the different pipeline control signals. 

Pipeline register F must be stalled for either a load/use hazard or a ret 
instruction: 


bool F stall = 
# Conditions for a load/use hazard 
E. icode in ( IMRMOVQ, IPOPQ } && 
E_dstM in ( d_srcA, d srcB } || 
# Stalling at fetch while ret passes through pipeline 
IRET in ( D .icode, E icode, M icode }; 





i 
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Pipeline register D must be set to bubble for a mispredicted branch or a ret 
instruction. As the analysis in the preceding section shows, however, it should 
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not inject a bubble when there is a load/use hazard in combination with a ret 
instruction: : 






bool D bubble = 
# Mispredicted branch 
(E icode == IJXX && !e Cnd) |l 
# Stalling at fetch while ret passes through pipeline 
# but not condition for a load/use hazard 
!(E icode in ( IMRMOVQ, IPOPQ ) && E dstM in ( d srcA, d srcB }) && 
IRET in ( D icode, E icode, M icode }; 










Write HCL code for the signal E: bubble i in the . PIPE E olematon 





vies, ee oaran arig a M. ROMA. EUR A ey P 
Practi¢e Problem 4.41; (solutionipads.493).c. eai eke a el 
Write HCL code for the signal set_cc in the PIPE incluent. This should 
only occur for OPq instructions, and should consider the effects of program excep- 
tions. 








Practice Problem 442: (solution pade 491776 OY, 
Write HCL code for the signals M bubble and W stall in the PIPE implemen- 
tation. The latter signal requires modifying the exception condition listed in Fig- 
ure 4.64. 
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This covers all of the special pipeline control signal values. In the complete 
HCL code for PIPE, all other pipeline control signals are set to zero. 






4.5.9 Performance Analysis 






We can see that the conditions requiring special action by the pipeline control 
i logic all cause our pipeline to fall short of the goal of issuing a new instruction on 
! every clock cycle. We can measure this inefficiehcy by determining how often a 
r bubble gets injected into the pipeline, since these cause unused pipeline cycles. A 
return instruction generates three bubbles, a load/use hazard generates one, and 
a mispredicted branch generates two. We can quantify the effect these penalties 
have on the overall performance by computing an estimate of the average number 
of clock cycles PIPE would require per instruction it executes, a measure known 
as the CPI (for “cycles per instruction”). This measure is the reciprocal of the 
average throughput of the pipeline, but with time measured in clock cycles rather 
than picoseconds. It is a useful measure of the architectural efficiency of a design. 

If we ignore the performance implications of exceptions (which, by definition, 
will only occur rarely), another way to think about CPI is to imagine we run the 
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Aside Testing the design 


As we have seen, there are many ways to introduce bugs intoa design, even for a simple microprocessor. 
With pipelining, there are many subtle interactions between the instructions at different pipeline stages. 
We have seen that many of the design challenges involve unusual instructions (such as popping to the 
stack pointer) or unusual instruction combinations (such as a not-taken jump followed by a ret). We 
also see that exception handling adds an entirely new dimension to the possible pipeline behaviors. 
How, then, can we be sure that our design is correct? For hardware manufacturers, this is a dominant 
concern, since they cannot simply report an error and have users download code patches over the 
Internet. Even a simple logic design error can have serious consequences, especially as microprocessors 
are increasingly used to operate systems that are critical to our lives and health, such as automotive 
antilock braking systems, heart pacemakers, and aircraft control systems. 

Simply simulating a design while running a number of “typical” programs is not a sufficient means 
of testing a system. Instead, thorough testing requires devising ways of systematically generating many 
tests that will exercise as many different instructions and instruction combinations as possible. In 
creating our Y86-64 processor designs, we also devised a number: of testing scripts, each of which 
generates many different tests, runs simulations of the processor, and compares the resulting register 
and memory values to those produced by our vis instruction set simulator. Here is a brief description 
of the scripts: 


optest. Runs 49 tests of different Y86-64 instructions with different source and destination registers 


jtest. Runs 64 tests of the different jump and call instructions, with different combinations of whether 
or not the branches are taken 


cmtest. Runs 28 tests of the different conditional move instructions, with different control combi- 
nations 


htest. Runs 600.tests of different data hazard possibilities, with different combinations of source 
and destination instructions, and with different numbers of nop instructions between the 
instruction pairs 


ctest. Tests 22 different control combinations, based on an analysis similar to what we did in Sec- 
tion 4.5.8 


etest. Tests 12 different combinations where an instruction causes an exception and the instructions 
following it could alter the programmer-visible state 


The key idea of this testing method is that we want«to be as systematic as possible, generating tests that 
, create the different conditions that are likely to cause pipeline errors. 


processor on some benchmark program and observe the operation of the execute 
stage. On each cycle, the execute stage either (1) processes an instruction and this 
instruction continues through the remaining stages to completion, or (2) processes 
a bubble injected due to one of the three special cases. If the stage processes a total 
of C; instructions and C, bubbles, then the processor has required around C; + C, 
total clock cycles to execute C; instructions. We say “around” because we ignore 
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Aside Formally verifying our design 
i Even when a design passes an extensive set of tests, we cannot be certain that it will operate correctly for 

all possible programs. The‘number of possible programs we could test is unimaginably large, even if we 
ónly consider tests consisting of short code segments. Newer methods of formal verification, however, 
hold the promise that we can have tools that rigorously consider áll possible behaviors of a system and 













L] 
détermine whether or not there are any design errors. 
Í We were able to apply formal verification to an earlier version of our Y86-64 processors [13]. 
i We set up a framework to compare the behavior of the pipelined design PIPE to'the unpipelined 
version SEQ. That is, it was able to prove that'for an arbitrary:machine-language program, the two 
i processors would have identical effects on the programmer-visible state. Of course, our verifier cannot 







actually run all possible programs, since there are an infinite number of them. Instead, it uses a form 
of proof by induction, showihg'a consistency between'the two processors on à cycle-by-cycle basis. 
Carrying out this analysis requires reasorfing about the hardware using symbolic methods in which we 
consider all program values to be arbitrary integers, and we abstract the ALU as a sort of “black box,” 
computing some unspecified function over its arguments. We assume only that the ALUs for SEQ and 











i PIPE compute identical functions. 
‘ We used the HCL descriptions òf the control logic to generate'the control logic-for our symbolic 
* processor models, and so we could catch any bugs in the HCL code. Being able to show'that SEQ 






| and PIPE are identical does not guarantee that either of them faithfully implements the instruction set 
architecture. However, it would uncover any bug due to an incorrect pipeline design, and this is the 
major source of design errors. 

In our experiments, we verified not only a version of PIPE similar to the one we have presented 
in this chapter but also several variants that we give as homework problems, in which we add more 
instructions, modify the hardware capabilities, or usé different branch prediction strategies. Interest- 
ingly, we found only one bug in all of our designs, involving contro] combination B (described in Section 
4:5.8) for our solution to the variant described in’Problem 4.58. This exposed‘a- weakness in our testing 
regime that caused us to add additional cases to the ctest testing script. 

Formal verification is still in an early stage of development. The tools are often difficult to use, and 
they do not have the capacity to verify large-scale designs: We were able to verify our processors in part 
because of their relative simplicity. Even then, it required several weeks of effort and multiple runs of 
the tools, each requiring up to 8 hours of computer time. This is an active area of research, with some 


tools bécoming commercially available and some in use at companies such as Intel, AMD, and IBM. 
xi 




















the cycles required to start the instructions flowing through the pipeline. We can 
then compute the CPI for this benchmark as follows: 


Cy 
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Thatis, the CPI equals 1.0 plus a penalty term C,/C; indicating the average number 
of bubbles injected per instruction executed. Since only three different instruction 
types can cause a bubble to be injected, we can break this penalty term into three 


components: 
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As we have méntioñed,* ;modern:Jogic design involves writing, textual représentations of hardware 
designs in a hardware déstription language. The design can then be tested by:bdth simulation and a 
variety of formal Verification. tools. Once we have confidence iri the design, we car use logic synthesis 


tools to.translate*the-design into‘actual logic circuits. 


We have developed models of'dur-Y86-64 processor'designs in thé Verilóg hardware description 
language. These designs combine modules implementing tbe basic building blocks of the processor, 
along with control logic generated directly from the HCL descriptions. We have been able to synthesize 
some of these. designs, download the logic circuit descriptions onto field- -programmable gate array 


(FPGA) hardware, and Tun thé | Processors on Pactual programs. 


CPI=1.0+ip+mp+rp 


where /p (for “load penalty”) is the average frequency with which bubbles are in- 
jected while stalling for load/use hazards, mp (for “mispredicted branch penalty") 
is the average frequency with which bubbles are injected when canceling instruc- 
tions due to mispredicted branches, and rp (for “return penalty”) is the average 
frequency with which bubbles are injected while stalling for ret instructions. Each 
of these penalties indicates the total number of bubbles injected for the stated 
reason (some portion of C,) divided by the total number of instructions that were 
executed (C;.) 

To estimate each of these penalties, we need to know how frequently the 
relevant instructions (load, conditional branch, and return) occur, and for each of 
these how frequently the particular condition arises. Let us pick the following set 
of frequencies for our CPI computation (these are comparable to measurements 
reported in [44] and [46]): 


e Load instructions (mrmovq and popq) account for 25% of all instructions 
executed. Of these, 2096 cause load/use hazards. 


* Conditional branches account for 2096 of all instructions executed. Of these, 
6096 are taken and 4096 are not taken. 


* Return instructions account for 296 of all instructions executed. 
We can therefore estimate each of our penalties as the product of the fre- 
quency of the instruction type, the frequency the condition arises, and the number 
of bubbles that get injected when the condition occurs: 


Instruction Condition 


Cause Name frequency frequency Bubbles Product 
Load/use Ip 0.25 0.20 1 0.05 
Mispredict mp 0.20 0.40 2 0.16 
Return rp 0.02 1.00 3 0.06 





Total penalty 0.27 
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The sum of the three penalties is 0.27, giving a CPI of 1.27. 

Our goal was to design a pipeline that can issue one instruction per cycle, 
giving a CPI of 1.0. We did not quite meet this goal, but the overall performance 
is still quite good. We can also see that any effort to reduce the CPI further should 
focus on mispredicted branches. They account for 0.16 of our.total penalty of 0.27, 
because conditional branches are common, our prediction strategy often fails, and 
we cancel two instructions for every misprediction. 


Suppose we use a branch prediction pon that achieves a success rate of 65%, 
such as backward taken, forward not taken (BTFNT), as described in Section 
4.5.4. What would be the impact on CPI, assuming all of the other frequencies are 
not affected? 


Leti us analyze the relative perforpianes ofü using conditional data franstere versus 
conditional control transfers for the programs you wrote for Problems 4.5 and 4.6. 
Assume that we are usirig these programs to compute the sum of the absolute 
values of a very long array, and so the overall performance is determined largely by 
the number of cycles required by the inner loop. A$sume that our jump instructions 


are predicted as being taken, and that around 5096 of the array values are positive. 


A. On average, how many instructions are executed in the inner loops of the 
two programs? 

B. Onaverage, how many bubbles would be injected into the inner loops of the 
two programs? 


C. What is the average number of clock cycles required per array element for 
the two programs? 


4.5.10 Unfinished Business 


We have created a structure for the PIPE pipelined microprocessor, designed the 
control logic blocks, and implemented pipeline control logic to handle special 
cases where normal pipeline flow does not suffice. Still, PIPE lacks several key 
features that would be required in an actual microprocessor design. We highlight 
a few of these and discuss what would be required to add them. 


Multicycle Instructions 


All of the instructions in the Y86-64 instruction set involve simple operations such 4 
as adding numbers. These can be processed in a single clock cycle within the exe- 
cute stage. In a more complete instruction set, we would also need to implement 
instructions requiring more complex operations such as integer multiplication and 
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division and floating-point operations. In a medium-performance processor such 
as PIPE, typical execution times for these operations range from 3 or 4 cycles for 
floating-point addition up to 64 cycles for integer division. To implement these 
instructions, we require both additional hardware to perform the computations 
and a mechanism to coordinate the processing of these instructions with the rest 
of the pipeline. 

One simple approach to implementing multicycle instructions is to simply 
expand the capabilities of the execute stage logic with integer and floating-point 
arithmetic units. An instruction remains in the execute stage for as many clock 
cycles as it requires, causing the fetch and decode stages to stall. This approach is 
simple to implement, but the resulting performance is not very good. 

Better performance can be achieved by handling the more complex opera- 
tions with special hardware functional units that operate independently of the 
main pipeline. Typically, there is one functional unit for performing integer mul- 
tiplication and division, and another for performing floating-point operations. As 
an instruction enters the decode stage, it can be issued to the special unit. While the 
unit performs the operation, the pipeline continues processing other instructions. 
Typically, the floating-point unit is itself pipelined, and thus multiple operations 
can execute concurrently in the main pipeline and in the different units. 

The operations of the different units must be synchronized to avoid incorrect 
behavior. For example, if there are data dependencies between the different 
operations being handled by different units, the control logic may need to stall 
one part of the system until the results from an operation handled by some other 
part of the system have been completed. Often, different forms of forwarding are 
used to convey results from one part of the system to other parts, just as we saw 
between the different stages of PIPE. The overall design becomes more complex 
than we have seen with PIPE, but the same techniques of stalling, forwarding, and 
pipeline control can be used to make the overall behavior match the sequential 
ISA model. 


Interfacing with the Memory System 


In our presentation of PIPE, we assumed that both the instruction fetch unit 
and the data memory could read or write any memory location in one clock 
cycle. We also ignored the possible hazards caused by self-modifying code where 
one instruction writes to the region of memory from which later instructions are 
fetched. Furthermore, we reference memory locations according to their virtual 
addresses, and these require a translation into physical addresses before the actual 
read or write operation can be performed. Clearly, it is unrealistic to do all of this 
processing in a single clock cycle. Even worse, the memory values being accessed 
may reside on disk, requiring millions of clock cycles to read into the processor 
memory. 

As will be discussed in Chapters 6 and 9, the memory system of a processor 
uses a combination of multiple hardware memories and operating system soft- 
ware to manage the virtual memory system. The memory system is organized as a 
hierarchy, with faster but smaller memories holding a subset of the memory being 
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backed up by slower and larger memories. At the level closest to the processor, 
the cache memories provide fast access to the most heavily referenced mefnory 
locations. A typical processor has two first-level caches—one for reading instruc- 
tions and one for reading and writing data. Another type of cache memory, known 
as a translation look-aside buffer, or. TLB, provides a fast translation from virtual 
to physical addresses. Using a combination of TLBs and caches, it is indeed pos- 
sible to read instructions and read or write data in a single clock cycle most of 
the time. Thus, our simplified view: of memory referencing by our processors is 
actually quite reasonable. 

Although the caches hold the most heavily referenced memory locations, 
there will be times when a cache miss occurs, where some reference is made to 
a location that is not held in the cache. In the best case, the missing data can be 
retrieved from a higher-level cache or from the main memory of the processor, 
requiring 3 to 20 clock cycles. Meanwhile, the pipeline simply stalls, holding the 
instruction in the fetch or memory stage until the cache can perform the read 
or write operation. In terms of our pipeline design, this can be implemented by 
adding more stall conditions to the pipeline control logic. A cache miss and the 
consequent synchronization with the pipeline is handled completely by hardware, 
keeping the time required down to a small number of clock cycles. 

In some cases, the memory location being referenced is actually stored in 
the disk or nonvolatile memory. When this occurs, the hardware signals a page 
fault exception. Like other exceptions, this will cause the processor to invoke the 
operating system's exception handler code. This code will then set up a transfer 
from the disk to the main memory. Once this completes, the operating system will 
return to the original program, where the instruction causing the page fault will be 
re-executed. This time, the memory reference will succeed, although it might cause 
a cache miss. Having the hardware invoke an operating system routine, which then 
returns control back to the hardware, allows the hardware and system software 
to cooperate in the handling of page faults. Since accessing a disk can require 
millions of clock cycles, the several thousand cycles of processing performed by 
the OS page fault handler has little impact on performance. 

From the perspective of the processor, the combination of stalling to han- 
dle short-duration cache misses and exception bandling to handle long-duration 
page faults takes care of any unpredictability in memory access times due to the 
structure of the memory hierarchy. 
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4.6 Summary 


We have seen that the instruction set architecture, or ISA, provides a layer of 
abstraction between the behavior of a processor—in terms of the set of instructions ; 
and their encodings—and how the processor is implemented. The ISA provides 1 


a very sequential view of program execution, with one instruction executed to : ji 
completion before the next one begins. 
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Aside State-of-the-art microprocessor design 


A five-stage pipeline, such as We have shown With the PIPE processor, represented the state of the art in 
processor design in the mid- 1980s. The prototype RISC processor developed,by Patterson's research 
group at Berkeley formed, the basis for the first SPARC processor, developed by Sun Microsystems 
in 1987. The processor developed by Henrfessy’s research group at.Stanford was commercialized by 
MIPS Technologies (a company founded by Hennessy) in 1986. Both of these,used five-stage pipelines. 
The Intel’ i486 processor also -uses a five-stage pipeline, although with a different partitioning of 
responsibilities among the Stages, with two' decode stages and a ‘combined-execyte/memory stage [27]. 

These pipelined’ designs aré limited toa thtoughput of at most one instruction per clock cycle. The 
CPI (for “cycles per instruction”) measure described in Section 4.5.9 éan' névér be less than 1.0. The 
differerit stages can only process one instruction at'a ‘time. “More recenit processors Support superscalar 
operation, meaning ‘that they’ can achieve a CPI: léss tlíah 1.0 by fetching, ‘decoding, and executing 

s multiple insirüctions-in' ‘parallel. MAS superscalaf processors "have become widespread, the accepted 
pertórmaricé nieasure has shiftéd' fróm ‘CPI tó-its reciprocal—the average number of instructions 
* executed per "cycle, or IPC. It cah exceed ‘1. 0 for süperscalar procéssórs. The most advanced designs 
use a technique kngŵn as diit- -üf-order execution'to executé „Multipl instructions i in parallel, possibly 
ina totally, different order than they occur in | the > program, while preserving tlie overall behavior implied 
by the sequential ISA model. „This fornrof' ekecittion i is described in Chapter 5 as patt of our discussion 
of program optimization. 
, Pipelined-processors are not just, historical artifacts, however: The majority of processors sold are 
: used in embedded systems, controlling automotive, functions, consumer products, and other devices 
, where the’ processor is not directly visible to the system’ à user. In these applications, the simplicity of 
a pipelined processór, süch'as the one we > ‘have explored i in this chapter, "reduces its cost and power 
requirements compared to higher-pertétmance r models. 

More recenily, as mülticóre processors have. gained'a followirig, Some have argued that we could 
get more overall ‘computing ‘power by integrating: thany" ‘simple processors oi a single chip rather 
thai a smiállér hümber of more coftipléx ' ‘ones. This strategy is ‘sometimes referred to as “many-core” 
, Processors [10]. 


Li 


We defined the Y86-64 instruction set by starting with the x86-64 instructions 
and simplifying the data types, address modes, and instruction encoding consider- 
ably. The resulting ISA has attributes of both RISC and CISC instruction sets. We 
then organized the processing required for the different instructions into a series 
of five stages, where the operations at each stage vary according to the instruction 
being executed. From this, we constructed the SEQ processor, in which an entire 
instruction is executed every clock cycle by having it flow through all five stages. 

Pipelining improves the throughput performance of a system by letting the 
different stages operate concurrently. At any given time, multiple operations are 
being processed by the different stages. In introducing this concurrency, we must 
be careful to provide the same program-level behavior as would a sequential 
execution of the program. We introduced pipelining by reordering parts of SEQ 
to get SEQ+ and then adding pipeline registers to create the PIPE— pipeline. 
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Web Aside ARCH:HCL HCL descriptions of Y86-64 processors 


In this chapter, we have looked at portions of the HCL code for several simple logic designs and for 
the control-logic for Y86-64 processors SEQ and PIPE. For reference, we provide documentation of 
the HCL language and complete HCL descriptions for the control logic of the two processors. Each of 
these descriptions requires only five to seven pages of HCL code, and it is worthwhile to study them in 
their entirety. 
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We enhanced the pipeline performance by adding forwarding logic to speed the 
sending of a result from one instruction to another. Several special cases require 
additional pipeline control logic to stall or cancel some of the pipeline stages. 

Our design included rudimentary mechanisms to handle exceptions, where 
we make sure that only instructions up to the excepting instruction affect the 
programmer-visible state. Implementing a complete handling of exceptions would 
be significantly more challenging. Properly handling exceptions gets even more 
complex in systems that employ greater degrees of pipelining and parallelism. 

In this chapter, we have earned several important lessons about processor 
design: 


« Managing complexity is a top priority. We want to make optimum use of the 
hardware resources to get maximum performance at minimum cost. We did 
this by creating a very simple and uniform framework for processing all of the 
different instruction types. With this framework, we could share the hardware 
units among the logic for processing the different instruction types. 


We do not need to implement the ISA directly. A direct implementation of the 
ISA would imply a very sequential design. To achieve higher performance, 
we want to exploit the ability in hardware to perform many operations si- 
multaneously. This led to the use of a pipelined design. By careful design and 
analysis, we can handle the various pipeline hazards, so that the overall effect 
of running a program exactly matches what would be obtained with the ISA 
model. 


Hardware designers must be meticulous. Once a chip has been fabricated, 
it is nearly impossible to correct any errors. It is very important to get the 
design right on the first try. This means carefully analyzing different instruction 
types and combinations, even ones that do not seem to make sense, such 
as popping to the stack pointer. Designs must be thoroughly tested with 
systematic simulation test programs. In developing the control logic for PIPE, 
our design had a subtle bug that was uncovered only after a careful and 
systematic analysis of control combinations. 


4.6.1 Y86-64 Simulators 


The lab materials for this chapter include simulators for the SEQ and PIPE 
processors. Each simulator has two versions: 





Homework Problems 473 


* The GUI (graphic user interface) version displays the memory, program code, 
and processor state in graphic windows..This provides a way to readily see how 
the instructions flow through the processors. The control panel also allows you 
to reset, single-step, or run the simulator interaétively. 


* The text version runs the same simulator, but it only displays information by 
printing to the terminal. This version is not as useful for debugging, but it 
allows automated testing of the processor. 


The control logic for the simulators is generated by translating the HCL 
declarations of the logic blocks into C code. This code is then compiled and linked 
with the rest of the simulation code. This combination makes it possible for you 
to test out variants of the original designs using the simulators. Testing scripts are 
also available that thoroughly exercise the different instructions and the different 
hazard possibilities. 


Bibliographic Notes 


For those interested in learning more about logic design, the Katz and Borriello 
logic design textbook [58] is a standard introductory text, emphasizing the use of 
hardware description languages. Hennessy and Patterson's computer architecture 
textbook [46] provides extensive coverage of processor design, including both 
simple pipelines, such as the one we have presented here, and advanced processors 
that execute more instructions in parallel. Shriver and Smith [101] give a very 
thorough presentation of an Intel-compatible x86-64 processor manufactured 
by AMD. 
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In Section 3.4.2, the x86-64 pushq instruction was described as decrementing the 
stack pointer and then storing the register at the stack pointer location. So, if we 
had an instruction of the form pushq REG, for some register REG, it would be 
equivalent to the code sequence 


subq $8,%rsp Decrement stack pointer 
movq REG, (%rsp) Store REG on stack 


A. In light of analysis done in Practice Problem 4.7, does this code sequence 
correctly describe the behavior of the instruction pushq %rsp? Explain. 


B. How could you rewrite the code sequence so that it correctly describes both 
the cases where REG is %rsp as well as any other register? 


446 9, 

In Section 3.4.2, the x86-64 popq instruction was described as copying the result 
from the top of the stack to the destination register and then incrementing the 
stack pointer. So, if we had an instruction of the form popq REG, it would be 
equivalent to the code sequence 
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movg (Arsp), REG Read REG from stack 
addq $8,%rsp Increment stack pointer 


A. In light of analysis done in Practice Problem 4.8, does this code sequence 
correctly describe the behavior of the instruction popq 4rsp? Explain. 


B. How could you rewrite the code sequence so that it correctly describes both 
the cases where REG is %rsp as well as any other register? 


447 999 

Your assignment will be to write a Y86-64 program to perform bubblesort. For 
reference, the: following C function implements bubblesort using array refer- 
encing: 


/* Bubble sort: Array version */ 
void bubble a(long *data, long count) { 
long i, last; 
for (last = count-1; last > 0; last-~) i 
for (i = 0; i < last; i++) 
if (data[i*1] < data[i])) { 
/* Swap adjacent elements */ 
long t = data[i*1]; 
data[i*i] = dataíi]; 
datafi] = t; 


wv og 0 005 b WUN- 


} 


A. Write and test a C version that references the array elements with pointers, 
rather than using array indexing, 


B. Write and test a Y86-64 program consisting of the function and test code. 
You may find it useful to pattern your implementation after x86-64 code 
generated by compiling your C code. Although pointer comparisons‘are 
normally done using unsigned arithmetic, you can use signed arithmetic for 
this exercise. 


4.48 99 

Modify the code you wrote for Problem 4.47 to implement the test and swap in 
the bubblesort function (lines 6-11) using no jumps and at most three conditional 
moves. 


4.49 999 
Modify the code you wrote for Problem 4.47 to implement the test and swap in the 
bubblesort function (lines 6-11) using no jumps and just one conditional move. 


4.50 999 
In Section 3.6.8, we saw that a common way to implement svitchstatementsisto , 
create a set of code blocks and then index those blocks using a jump table. Consider 
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t 
#include <stdio.h> 
/* Example use of switch statement */ ” 


long switchv(long idx) { 

long result = 0; 

switch(idx), { : 

case 0: ast 
result = Oxaaa; 
break; 

‘case 2:5 

5icase' 5: 
result = Oxbbb; 
brealy;' 

«casa Sy a c 
result = Oxccc; 
break; 

default: 

‘result = Oxddd; 


} i 
return result; 4 
} vi 1 
~ 4 : 
/* Testing Code */ 
#define CNT 8 
#define MINVAL -1 


int main() { 
long vhàls[CNT]; 
long i; 
for (i = 0; i < ONT; i++) 1 
vals[i] = switchv(i + MINVAL); - 
printf("idx = 41d, val = Ox#1x\n", i + MINVAL, vals[il); 





} 
return 0; 


J " 


Figure 4.69 Switch statements can be translated into Y86-64 code. This requires 
implementation of a jump table. 


the C code shown in Figure 4.69 for a function switchv, along with associated 
test code. 

Implement switchv in Y86-64 using a jump table. Although the Y86-64 in- 
struction set does not include an indirect jump instruction, you can get the same 
effect by pushing a computed addres’ onto-the stack and then executing the ret 


to : 
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| instruction. Implement test code similar to what is shown in C to demonstrate that 
your implementation of switchv will handle both the cases handled explicitly as 
j well as those that trigger the default case. 
t 


4.51 9 

Practice Problem 4.3 introduced the iaddq instruction to add immediate data to a 
register. Describe the computations performed to implement this instruction. Use 
the computations for irmovq and OPq (Figure 4.18) as a guide. 


4.52 99 

The file seq-full.hel contains the HCL description for SEQ, along with the 
declaration of a constant TIADDQ having hexadecimal value C, the instruction code 
for iaddg. Modify the HCL descriptions of the control logic blocks to implement 
the iaddq instruction, as described in Practice Problem 4.3 and Problem 4.51. See 
the lab material for directions on how to generate a simulator for your solution 
and how to test it. 
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| 4.53 OO 
| Suppose we wanted to create a lower-cost pipelined processor based on the struc- 
ture we devised for PIPE— (Figure 4.41), without any bypassing. This design would 
) handle all data dependencies by stalling until the instruction generating à needed 
| 
k 
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value has passed through the write-back stage. 

The file pipe-stall.hcl contains a modified version of.the HCL code for 
PIPE in which the bypassing logic has been disabled. That is, the signals e valA 
and e, valB are simply declared as follows: 


## DO NOT MODIFY THE FOLLOWING CODE. 
## No forwarding. valA is either valP or value from register file 
word d_valA = [ 
[ D icode in ( ICALL, IJXX ) : DivalP; # Use incremented PC 
' 1: d_rvalA; # Use value read from register file 
1 
L 


l; 


## No forwarding. valB is value from register file 
word d valB = d_rvalB; 


Modify the pipeline control logic at the end of this file so that it correctly han- 
dles all possible control and data hazards. As part of your design effort, you should 
analyze the different combinations of control cases, as We did.in the design of the 
pipeline control logic for PIPE. You will find that many different combinations 
can occur, since many more conditions require the pipeline to stall. Make sure 
your control logic handles each combination correctly. See the lab material for 
directions on how to generate a simulator for your solution and how to test it. 


4.54 @¢ 

The file pipe-full .hel contains a copy of the PIPE HCL description, along witha 
declaration of the constant value TTADDQ. Modify this file to implement the iaddq 
instruction, as described in Practice Problem 4.3 and Problem 4.51. See the lab 








material for directions on how to generate a simulator for your solution and how 
to test it. 1 


4.55 99 

The file pipe-nt . hcl contains a copy of the HCL code for PIPE, plus a declaration 
of'the constant J. YES with value 0, the function code for an unconditional jump 
instruction. Modify the branch prediction logic so that it predicts conditional 
jumps as being not taken while continuing to predict unconditional jumps and 
call as being taken. You will need to devise a way to get valC, the jump target 
address, to pipeline register M to recover fróm mispredicted branches. See the lab 


material for directions on how to generate a simulator for your solution and how 
to test it. 
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The file pàpe-btfnt .hcl contains a copy of the HCL code for PIPE, plus a decla- 
ration of the constant J, YES with value 0, the function code for an unconditional 
jump instruction. Modify the branch prediction logic so that it predicts condi- 
tional jumps as being taken when ValC < valP (backward branch) and as being 
not taken when valC > valP (forward branch). (Since Y86-64 does not support 
unsigned arithmetic, you should infplement this test using a signed comparison.) 
Continue to predict unconditional jumps and cali as being taken. You will need 
to devise a way to get both valC and valP to pipeline register M to recover from 
mispredicted branches. See the lab material for directions on how to generate a 
simulator for your solution and how to test it. 
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In our design of PIPE, we generate a stall whenever one instruction performs a 
load, reading a value from memory into a register, and the next instruction has this 
register as a source operand.When the source gets used in the execute stage, this 
stalling is the only way to avoid a hazard. For-cases where the second instruction 
stores the source operand to memory, such as with an rmnovq or pushq instruction, 
this stalling is not necessary. Consider the following code examples: 


1 mrmovg O(Árcx,Xrdx # Load 1 
2 pushq %rdax # Store 1 
3 nop 

4 popq Ardx # Load 2 
5 rmmovq %rax,0(%rdx) # Store 2 


In lines 1 and 2, the mrmovq instruction reads a value from memory into 
%rdx, and the pushq instruction then pushes this value onto the stack. Our design 
for PIPE wouid stall the pushq instruction to avoid a load/use-hazard. Observe, 
however, that the value of %xdx is not required by the pushq instruction until it 
reaches the memory stage. We can add an additional bypass path, as diagrammed 
in Figure 4.70, to forward the memory output (signal m valM) to the valA field 
in pipeline register M. On the next tfock:cycle, this forwarded value can then be 
written to memory. This technique is known as load forwarding. 

Note that the second example (lines 4 and 5) in the code sequence above 
cannot make use of load forwarding. The value loaded by the popq instruction is 
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Figure 4.70 Execute and memory stages capable of load forwarding. By adding a 
bypass path from the memory output to the source of valA in pipeline register M, we can 
use forwarding rather than stalling for one form of load/use hazard. This is the subject 
of Problem 4.57. 


used as part of the address computation by the next instruction, and this value is 
required in the execute stage rather than the memory stage. 


A. Write a logic formula describing the detection condition for a load/use haz- 
ard, similar to the one given in Figure 4.64, except that it will not cause a 
stall in cases where load forwarding can be used. 


B. The file pipe-1f.hcl contains a modified version of the control logic for 
PIPE. It contains the definition of a signal e_valA to implement the block 
labeled “Fwd A" in Figure 4.70. It also has the conditions for a load/use haz- 
ard in the pipeline control logic set to zero, and so the pipeline control logic 
will not detect any forms of load/use hazards. Modify this HCL description 
to implement load forwarding. See the lab material for directions on how to 
generate a simulator for your solution and how to test it. 
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Our pipelined design is a bit unrealistic in that we have two write ports for the 
register file, but only the popq instruction requires two simultaneous writes to the 
register file. The other instructions could therefore use a single write port, sharing 
this for writing valE and valM. The following figure shows a-modified version 
of the write-back logic, in which we merge the write-back register IDs (W_dstE 
and W dstM) into a single signal w_dstE arid thé write-back values (W. valE and 
W_valM) into a single signal w_valE: 


The logic for performing the merges is written in HCL as follows: 


## Set E port register ID 

word w dstE = [ 
## writing from valM 
W.dstM !- RNONE : W dstM; 
1: W_dstE; 

l; 


## Set E port value 

word w_valE = [ 
W.dstM != RNONE : W valM; 
1: W valE; 

J]; 


The control for these multiplexors is determined by dstE— when it indicates 
there is some register, then it selects the value for port E, and otherwise it selects 
the value for port M. 

In the simulation model, we can then disable register port M, as shown by the 
following HCL code: 


## Disable register port M 
## Set M port register ID 
word w_dstM = RNONE; 


## Set M port value 
word w, valM = Q; 


E The challenge then becomes to devise a way to handle popq. One method is 
E to use the control logic to dynamically process the instruction popq rA so that it 
` has the same effect as the two-instruction sequence 
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jaddq $8, %rsp 


mrmovq -8(4rsp), rA 
a 


(See Practice Problem 4.3 for a description of the iaddq instruction.) Note.the 
ordering of the two instructions to make sure popq %rsp works properly. You can 
do this by having the logic in the decode stage treat popq the same as it would the 
jaddq listed above, except that it predicts the next PC to be equal to the current 
PC. On the next cycle, the popq instruction is refetched, but the instruction code 
is converted to a special value IPOP2. This is treated as a special instruction that 
has the same behavior as the mrmovq instruction listed above. 

The file pàipe-1w.hc1 contains the modified write port logic described above. 
It contains a declaration of the constant IPOP2 having hexadecimal value E. It 
also contains the definition of a signal f_icode that generates the icode field for | 
pipeline register D. This definition can be modified to insert the instruction code i 
IPOP2 the second time the popq instruction is fetched. The HCL file also contains | 
a declaration of the signal f_pc, the value of the program counter generated in the 1 
fetch stage by the block labeled “Select PC" (Figure 4.57). | 

Modify the control logic in this file to process popq instructions in the manner | 
we have described. See the Jab material for directions on how to generate a 
simulator for your solution and how to test it. 





4.59 € 
Compare the performance of the three versions of bubblesort (Problems 4.47, h 


4.48, and 4.49). Explain why one version performs better than the other. 
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Solution to Problem 4.1 (page 360) | 
Encoding instructions by hand is rather tedious, but it will solidify your under. — 4 
standing of the idea that assembly code gets turned into byte sequences by the 
assembler. In the following output from our Y86-64 assembler, each line shows 
an address and a byte sequence that starts at that address: 





1 0x100: | .pos 0x100 # Start code at address 
0x100 j 

2 0x100: 30£30£00000000000000 | irmovq $15,4rbx 

3  Oxi0a: 2031 | rrmovq 4rbx,4rcx 

4  OxiOc: | 1oop: 

5 OxiOc: 4O13fdffffffffffffff | rmmovq %rex,-3(%rbx) 

6 0x116: 6031 | addq  %rbx, 4rcex 

7 0x118: 700c01000000000000 | jmp loop 





Several features of this encoding are worth noting: 


e Decimal 15 (line 2) has hex representation 0x000000000000000£. Writing the 
bytes in reverse order gives Of 00 00 00 00 00 00 00. 
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* Decimal —3 (line 5) has hex representation Oxfffffffffffffffd. Writing 
the bytes in reverse order gives fd £f f£ ff ff ff ff ff. 


* The code starts at address 0x100. The first instruction requires 10 bytes, while 


the second requires 2. Thus, the loop targét will be 0x0000010c. Writing these 
bytes in reverse order gives 0c 01 00 00 00 00 00 00. E 


Solution to Problem 4.2 (page 360) 

Decoding a byte sequence‘ by “hand helps you understand the task faced bya 
processor. It must read byte sequences and determine what instructions are to 
be executed. In the following, we show the assembly code used to generate each 
of the byte sequences. To the left of the assembly code, you tan see the address 
and byte'sequence for each instruction. 


A. Some operations with immediate data and address displacements: 


0x100: 3Of3fcffffffffffffff | irmovq $-4, %rbx 
0x10a: 40630008000000000000 | rmmovq %rsi,0x800(%rbx) 
0x114: 00 | halt 


. Code including a function call: 


0x200: a06f 
0x202: 800c02000000000000 


pushq %rsi 
call proc 


0x20c: 
0x20c: 30£30a00000000000000 
0x216: 90 


proc: 
irmovq $10,%rbx 
ret 


| 
| 
Ox20b: 00 | halt 
| 
| 
| 


- Code containing illegal instruction specifier byte Ox£0: 


0x300: 50540700000000000000 [ mrmovq 7 (Arsp) ,%rbp 

Ox30a: 10 | nop 

Ox30b: fO | .byte OxfO # Invalid instruction code 
Ox30c: bOif J popq %rex 


- Code containing a jump operation: 


0x400: | loop: 

0x400: 6113 | Subq %rex, %rbx 
0x402: 730004000000000000 | je loop’ 
0x40b: 00 l halt 


. Code containing an invalid second byte in a pushq instruction: 


0x500: 6362 | xorg Arsi,Xrdx 

Ox502: a0 | -byte Oxa0 # pushq instruction 
code 

Ox503: £0 l -byte OxfO0 # Invalid register 
specifier byte 
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Solution to Problem 4.3 (page 369) 


Using the iaddq instruction, we can rewrite the sum function as 





r 


f | 
# long sum(long *start, long count) 


# Start in %rdi, count in %rsi 
sum: 
xorq %rax,4%rax 
andq %rsi,%rsi 
jmp test 
loop: 
mrmovq (%rdi) ,%r10 
addq %r10,%rax 
iaddq $8,%rdi 
iaddq $-1,%rsi 
test: 
jne 
ret 


loop 


Solution to Problem 4.4 (page 370) 
Gcc, running on an x86-64 machine, 


long rsum(long *start, long count): 
start in %rdi, count in Arsi 


rsum: 
movl $0, %eax 
testq %rsi, %rsi 
jle .L9 
pushq %rbx 
movq (žrdi), Arbx 
subq $1, %rsi 
addq $8, “rdi 
call rsum 
addq ‘rox, %rax 
popq drbx 

.L9: 
rep; ret 


# sum = 0 
# Set condition codes 


[7n 
$ Get *start 


# Add to sum 
4 starttt+ 
# count-- 


# Stop when 0 


produces the following code for rsum: 


This can easily be adapted to produce Y86-64 code: 


# long rsum(long *start, long count) 


# start in %rdi, pount in Xrsi 
rsum: 

xorg %rax, 4rax x 
andq Zrsi,¥%rsi 
je return 


pushq %4rbx 





# Set return value to 0 

# Set condition codes 

# If count == 0, return 0 

# Save callee-saved register 
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mrmovq (Ardi),Xrbx # Get *start 

irmovq $-1,%ri0 

addq %r10,%rsi # count-- 

irmovq $8,%r10 

addq %ri0,%rdi # start++ 

call rsum 

addq %rbx, %rax # Add *start to sum 

popq %rbx # Restore callee-saved register 
return: t i 

ret 


Solution to Problem 4.5 (page 370) 
This problem gives you a chance to try your hand.at writing assembly code. 


* long absSum(long *start, long count) 
8 start in %rdi, count in Y%rsi 
absSum: 
! irmovq $8,%r8 Constant 8 
irmovq $1,%r9 Constant 1 
xorq A4rax,4rax sum = 0 
andq %rsi,%rsi Set condition codes 
jmp test 


1 
2 
3 
4 
5 
6 
7 
8 
9 


mrmovg (%rdi) ,%r10 X = *start 
xorg Wril,Xri1 Constant 0 

subq %r10,%rit -xX 

jle pos Skip if -x <= 0 
rrmovq %ri1,%r10 x = -xX 


addq %r10,%rax Add to sum 
addq 4r8,7rdi startt++ 
subq %r9,%rsi count-- 


jne loop Stop when O 
ret 


Solution to Problem 4.6 (page 370) 

This problem gives you a chance to try your hand at writing assembly code with 
conditional moves. We show only the code for the loop. The rest is the same as for 
Problem 4.5: 


loop: 
mrmovq (%rdi) ,%r10 # x = *start 
xorg 4rii,4rii # Constant 0 
subq %4r10,%r11 # -x 
cmovg %r11,%r10 # If -x > O then x = -x 
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14 addq %r10,%rax # Add to sum 
15 addq %r8,%rdi 4 startt++ 

16 subq 4r9,%rsi # count-—- 

17 test: 

18 jne loop # Stop when 0 


Solution to Problem 4.7 (page 370) 
Although it is hard to imagine any practical use for this particular instruction, it is 
important when designing a system to avoid any ambiguities in the specification. 
We want to determine a reasonable convention for the instruction’s behavior and 
to make sure each of our implementations adheres to this convention. 

The subg instruction in this test compares the starting value of %irsp to the 
value pushed'onto the stack. The fact that the reSult of this subtraction is zero 
implies that the old value of 4rsp gets pushed. 


Solution to Problem 4.8 (page 371) 

It is even more difficult to imagine why anyone would want to pop to the stack 
pointer. Still, we should decide on a convention and stick with it. This code 
sequence pushes Oxabcd onto the stack, pops to %rsp, and returns the popped 
value. Since the result equals Oxabcd, we can deduce that popq 'Irsp sets the stack 
pointer to the value read from memory. It is therefore equivalent to the instruction 
mrmovq (%rsp) , Zrsp. 


Solution to Problem 4.9 (page 374) 
The EXCLUSIVE-OR function requires that the 2 bits have opposite values: 


bool xor = (la &k b) Ii (a &k tb); 


In general, the signals eq and xor will be complements gt each other. That is, 
one will equal 1 whenever the other is 0. 


Solution to Problem 4.10 (page 377) 

The outputs of the EXCLUSIVE-OR circuits will be the complements of the bit equal- 
ity values. Using DeMorgan's laws (Web Aside DATA:BOOL on page 52), we can 
implement AND using or and Nor, yielding the circuit shown in Figure 4.71. 


Solution to Problem 4.11 (page 379) 
We can see that the.second part of the case expression can be written as 


B<=C : B; 


Since the first line will detect the case where Ais the minimum element, the second 
line need only determine whether B or C is minimum. 


Solution to Problem 4.12 (page 380) 
This design is a variant of the one to find the minimum of the three inputs: 
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Figure 4.71 
Solution for Problem 
4.10. 





word Med3 = [ 
A <= B && B <= 
C <= B && B <= 
B <= A £k A <= 
C <= A k&& A <= 
1 


Wara 


Ji 

Solution to Problem 4.13 (page 387) : Ire 1 A2 
These exercises help make thé stage computations more concrete. We can see from 
the object code that this instruction is located at address 0x016. It cónsists of 10 
bytes, with the first two being 0x30 and Ox£4. The last 8 bytes are a byte-reversed 
version of 0x0000000000000080 (decimal 198). 


Generic 1 Specific 

Stage irmovq V, rB w» irmovq $128, %rsp 

Fetch icode:ifun + Mj[PC] icode:ifun + M,[0x016]=3:0 
TA:B < M,[PC +1] rA:rB < Mi[0x017] 51:4 " 
val < Mg[PC--2] ^ vaic — Mg[0x018]: 128 n : 
valP <— PC+10 valP < 0x016 +10 = 0x020 

Decode 

Execute valE <— 0+valC valE < 0+ 128 = 128 

Memory 

Write back R[rB] < valE R[%rsp] «- valE = 138 

PC update PC « valP PC < valP = 0x020 


This instruction sets register %rsp to 128 and increments the PC by 10. 
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Solution to Problem 4.14 (page 390) 

We can see that the instruction is located at address 0x02c and consists of 2/bytes 
with values 0xbO and 0x00f. Register Zrsp was set to 120 by the pushq instruction 
(line 6), which also stored 9 at this memory location. 


Generic Specific 
opq rA popa Xrax 


Stage p 


Fetch icode:ifun +- Mj[PC] icode:ifun <+ M,[0x02c]=b:0 
rA:rB « Mj[PC + 1] rA:B «€ M,[0x02d]=0:f 


valP — PC+2 valP <- 0x02c + 2 = 0x02e 


Decode vala < R[Arsp] valA < R[%rsp]= 120 
valB <+- R[%rsp] valB + R[%rsp] = 120 


Execute valE «— valB+8 valE <- 120-4-8- 128 
Memory valM < Me[valA] valM <— Mg[120]-—9 


Write back R[%rsp] + valE R[Arsp] +- 128 
R[rA] <- valM R[%rax] — 9 


PC update PC + valP PC < Ox02e 
The instruction sets %rax to 9, sets %xsp to 128, and increments the PC by 2. 


Solution to Problem 4.15 (page 391) 

Tracing the steps listed in Figure 4.20 with rA equal to 4rsp, we can see that in 
the memory stage the instruction will store valA, the original value of the stack 
pointer, to memory, just as we found for x86-64. 


Solution to Problem 4.16 (page 392). 

Tracing the steps listed in Figure 4.20 with rA equal to 4rsp, we can see that both 
of the write-back operations will update %rsp. Since the one writing valM would 
occur last, the net effect of the instruction will be tó write the value read from 
memory to %rsp, just as we saw for x86-64. 


Solution to Problem 4.17 (page 393) 

Implementing conditional moves requires only minor changes from register-to- 
register moves. We simply condition the write-back step on the outcome of the 
conditional test: 1 


Stage cmovXX rA, rB 


Stage — a oM 

Fetch icode:ifun + M,[PC] 
rA:B + M,(PC +1] 
valP — PC+2 


Decode vala < R[rA] 
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Stage cmovXX rA, rB : 
a a 
Execute valE «— 0-+valA 


Cnd + Cond(CC, ifun) 
Memory i 
Write back if (Cnd) R[rB] < valE 
PC update PC < valP 


jet 
Solution to Problem 4.18 (page 394) 


We can see that this instruction is located at address 0x037 and is 9 bytes long. 
The first byte has value 0x80, while the last 8 bytes are'a byte-reversed version of 


0x0000000000000041, the call target. The stack poinfer was set to 128 by the popq 
instruction (line 7). 


Generic , . Specific , 
Stage ' call Dest call 0x041 
Fetch icode:ifun «- Mj[PC] icode:ifun < M,[0x037] = 8:0 
valC < Mg[PC +1] valC +- Mg[0x038] = 0x041 
valP — PC+9 n valP «— 9x037 +9 = 0x940 
Decode 
valB «— R[4rsp] : valB < R[4rsp]- 128 ^ 
Execute valE < valB4- —8 valE < 128+ —8 = 120 
Memory Ma[valE] «- valP Mg[120] «— 0x040 
Write back R[%rsp] <+ valE R[irsp] < 120 
PC update PC < valC PC < 0x041 


The effect of this instruction is to set ^rsp to 120, to’store 0x040 (the return 
address) at this memory-address, and to set the PC to 0x041 (the call target). 


Solution to Problem 4.19 (page 406) " 

All of the HCL code in this and other practice problems is straightforward, but 
trying to generate it yourself will help. you think about the different instructions 
and how they are processed. For this problem, we can simply look at the set of 
Y86-64 instructions (Figure 4.2) and determine which have a constant field. 


bool need, valC - 
icode in ( IIRMOVQ, IRMMOVQ, IMRMOVQ, IJXX, ICALL LE 
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Solution to Problem 4.20 (page 407) 
This code is similar to the code for srcA. 


word srcB = [ 
icode in { IOPQ, IRMMOVQ, IMRMOVQ } : rB; 
icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 
i : RNONE; # Don't need register 

li 


Solution to Problem 4.21 (page 408) 
This code is similar to the code for dstE. 


word dstM = [ EN 

icode in ( IMRMOVQ, IPOPQ } :,rÀ; 

1: RNONE; # Don't write any register 
1; 


Solution to Problem 4.22 (page 408) 

As we found in Practíce Problem 4.16, we want the write vid the M port to take 
priority over the write via the É port in order to store the value read from memory 
into &rsp. 


Solution to Problem 4.23 (page 409) 


This code is similar to the códe for aluA. 


word aluB = Í 
icode in { IRMMOVQ, IMRMOVQ, IOPQ, ICALL, 
IPUSHQ, IRET, IPOPQ } : valB; 
icode in ( IRRMOVQ, IIRMOVQ ) : 0; 
# Other instructiohs don't need ALU 
1; 


Solution to Problem 4.24 (page 409) 

Implementing conditional moves is surprisingly simple: we disable writing to the 
register, file by setting the destination register to RNONE when the condition does 
nothold. +! 


word dstE = [ 
icode in { IRRMOVQ } && Cnd : rB; 
icode in ( IÍRMOVQ, IOPQ) : rB; L 
icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 
1 :'RNONE; # Don't write ahy register 14 
13 i 


Solution to Problem 4.25 (page 410) 
This code is similar to the code for mem, addr. 
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word.mem_data = [ 
# Value from register 
icode in { IRMMOVQ, IPUSHQ Y : valá; 
# Return PC 
icode == ICALL : valP; 
# Default: Don't write anything 
l; 


Solution to Problem 4.26 (page 410) 
This code is similar to the code for mem read: 


bool mem write = icode in f IRMMOVQ, IPUSHQ, ICALL }; 
cr? 
Solution to Problem 4.27 (page 411) 
Computing,the Stat field requires collecting status information from several stages: 


## Determine instruction status 
word Stat = [ 
inem error || dmem error : SADR; 
!linstr valid: SINS; 
icode == IHALT : SHLT; 
1: SAOK; 
]; 


Solution to Problem 4.28 (page 417) 

This problem is an interesting exercise in trying to find the optimal balance among 
a set of partitions. It provides a number of opportunities to compute throughputs 
and latencies in pipelines. 


A. For a two-stage pipeline, the best partition would be to have blocks A, B, 
and C in the first stage and D, E, and F in the second. The first stage has a 
delay of 170 ps, giving a total cycle time of 170 + 20 = 190 ps. We therefore 
have a throughput.of 5.26 GIPS and a latency of 380 ps. 


B. For a three-stage pipeline, we should have blocks A and B in the first stage, 
blocks C and D in the second, and blocks E and F in the third. The first 
two stages have a delay of 110 ps, giving a total cycle time of 130 ps and a 
throughput of 7. 69°GIPS. The latency is 390 ps. 


C. For a four-stage pipeline, we should have block A in the first stage, blocks B 
and C in the second, block D in the third, and blocks E and F in the fourth, 
The second stage requires 90 ps, giving a total cycle time of 110 ps and a 
throughput of 9.09 GIPS. 'The PERG is 440 ps. 


D. The optithal design would bë d five- -stage pipeline, with each block i in its 
own stage, except that the fifth stage has blocks E and F. The cycle time is 
80 + 20 —100 ps, for a throughput of around 10.00 GIPS and a latency of 
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500 ps. Adding more stages would not help, since we cannot run the pipeline 
any faster than one cycle every 100 ps. 







Solution to Problem 4.29 (page 418) 
Each stage would have combinational logic requiring 300/k ps and a pipeline 


| ! register requiring 20 ps. 
j A. The total latency would be 300 + 20k ps, while the throughput (in GIPS) 
would be 















1,000 _1,000k 
300..20 300 + 20k 








B. As we let k go to infinity, the throughput becomes 1,000/20 = 50 GIPS. Of 
course, the latency would approach infinity as well. A i 





E This exercise quantifies the diminishing returns of deep pipelining. As we try to 
subdivide the logic into many stages, the latency of the pipeline registers becomes 
a limiting factor. 








Solution to Problem 4.30 (page 449) 
This code is very similar to the corresponding code for SEQ, except that we cannot 
yet determine whether the data memory will generate an error signal for this 


instruction. 











# Determine status code for fetched instruction 
li word f stat = [ ; 

] imem error: SADR; t 
$ linstr valid : SINS; 

f icode == IHALT : SHLT; 

1 : SAOK; 









3; 


Solution to Problem 4.31 (page 449) 
This code simply involves prefixing the signal names in the code for SEQ with d 


andD.. 







Z on, mi 








word d_dstE = [ 
D_icode in { IRRMOVQ, IIRMOVQ, IOPQ} : D zB; 
D icode in { IPUSHQ, IPOPQ, ICALL, IRET } : RRSP; 
1: RNONE; # Don't write any register: 







1; 


; Solution to Problem 4.32 (page 452) 

The rrmoyg instruction (line 5) would stali for one cycle due to,a load/use hazard 
caused by the popa instruction (line 4). As it enters the décode stage, the popq 
instruction would be in the memory stage, giving both M, dstE and M, dstM equal | 
to 4rsp. If the two cases were reversed, then the write back from M valE would 
take priority, causing the incremented stack pointer to be passed as the argument 
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to the rrmovq instructiori.; This would not be consistent with the convention for 
handling popq %rsp determined in Practice Problem 4.8. 


Solution to Problem 4.33 (page 452) 
This problem lets you experience one of the important tasks in processor design— 
devising test programs for a new processor. In general, we should have test pro- 
grams that will exercise all of the different hazard possibilities and will generate 
incorrect results if some dependency is not handled properly. 

For this example, we can use a sli ghtly modified version of the program shown 
in Practice Problem 4.32: 


irmovq $5, %rdx 
irmovq $0x100,%rsp 
rmmovq %rdx,0(%rsp) 
popq Xrsp 

nop 

nop 

rrmovq &rsp,%rax 


The two nop instructions will cause the popq instruction to be in the write-back 
stage when the rrmovq instruction is in the decode stage. If the two forwarding 
sources in the write-back stage are given the wrong priority, then register %rax 
will be set to the incremented program counter rather than the value read from 


memory. 


Solution to Problem 4.34 (page 453) 
This logic only needs to check the five forwarding sources: 


word d valB = [ 
d_srcB e_dstE : e_valE; # Forward valE from execute 
d_srcB == M_dstM : m_valM; # Forward valM from memory 
d _srcB : M, valE; * Forward valE from memory 
d srcB : W_valM; # Forward valM from write back 
d_srcB W dstE : W valE; # Forward valE from write back 
1: d_rvalB; # Use value read from register file 


]; 


Solution to Problem 4.35 (page 454) 

This change would not handle the case where a conditional move fails to satisfy 
the condition, and therefore sets the dstE value to RNONE. The resulting value could 
get forwarded to the next instruction, even though the conditional transfer does 
not occur. 


irmovq $0x123, %rax 

irmovq $0x321,%rdx 

xorg Arcx, rcx # CC = 100 

cmovne %rax,%rdx # Not transferred 
addq %rdx ,%rdx # Should be 0x642 
halt 
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This code initializes register #rdx to 0x321. The conditional data transfer does 
not take place, and so the final addq instruction should double the value in 4rdx to 
0x642. With the altered design, however, the conditional move source value 0x321 
gets forwarded into ALU input valA, while input valB correctly gets operarid value 
0x123. These inputs get added to produce result 0x444.  ' 


Solution to Problem 4.36 (page 455) 
This code completes the computation of the status code for this instruction. 


## Update the status 

word m_stat = [ 
dmem_error : SADR; 
1: Mstat; 

J; 


Solution to Problem 4.37 (page 461) 
The following test program is designed to set up control combination A (Figure 
4.67) and detect whether something goes wrong: 


1 # Code to generate a combination of not-taken branch and ret 

2 irmovq Stack, %rsp 

3 irmovq rtnp,%rax 

4 pushq Arax # Set up return pointer 

5 xorg 4rax,4rax # Set Z condition code 

6 jne target # Not taken (First part of combination) 
7 irmovq $1,4rax # Should execute this 

8 halt 

9 target: ret # Second part of combination 

10 irmovq $2,4rbx # Should not execute this 
11 halt 

12  rtnp: irmovq $3,4rdx # Should not execute this 

13 halt 

14 .pos 0x40 


15 Stack: 


This program is designed so that if something goes wrong (for example, if 
the ret instruction is actually executed), then the program will execute one of the 
extra irmovq instructions and then halt. Thus, an error in the pipeline would cause 
some register to be updated incorrectly. This code illustrates the care required to 
implement a test program. It must set up a potential error' condition and then 
detect whether or not an error occurs. 


Solution to Problem 4.38 (page 462) 

The following test program is designed to set up control combination B (Figure 
4.67). The simulator will detect a case where the bubble and stall control signals 
for a pipeline register are both set to zero, and so our test program need only set 
up the combination for it to be detected. The biggest challenge is to make the 
program do something sensible when handled correctly. 





Solutions to Practice Problems 


# Test instruction that modifies %esp followed by ret 
irmovq mem, %4rbx 
mrmovq O(%rbx),irsp # Sets {rsp to point to return point 
ret # Returns to return point 
halt ' # 
rtnpt: irmovq $5,%rsi # Return point 
halt 
.pos 0x40 
mem: -quad stack # Holds desired stack pointer 
-pos 0x50 
stack: .quad rtnpt # Top of stack: Holds return point 
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This program uses two initialized words in memory. The first word (mem) holds 
the address of the second (stack—-the desired stack pointer). The second word 
holds the address of the desired return point for the ret instruction. The program 
loads the stack pointer into %rsp and executes the ret instruction. 


Solution to Problem 4.39 (page 463) 


From Figure 4.66, we can see that pipeline register D must be stalled for a load/use 


hazard: 
° 
bool D_stall = 


# Conditions for a load/use hazard 
E_icode in { IMRMOVQ, IPOPQ } && 
E_dstM in { d_srcA, d_srcB }; 


Solution to Problem 4.40 (page 464) 
From Figure 4.66, we can see that pipeline register E must be set to bubble for a 
load/use hazard or for a mispredicted branch: 


bool E_bubble = 
# Mispredicted branch 
(E_icode == IJXX && !e Cnd) |l 
# Conditions for a load/use hazard 
E icode in ( IMRMOVQ, IPOPQ ) && 
E dstM in { d.srcA, d srcB); 


Solution to Problem 4.41 (page 464) 
Thiscontrol requires examining the code of the executing instruction and checking 
for exceptions further down the pipeline. 


E ## Should the condition codes be updated? 
! bool set_cc = E icode == IOPQ && 

# State changes only during normal operation 

!m stat in { SADR, SINS, SHLT ) && !W stat in ( SADR, SINS, SHLT ); 


| Solution to Problem 4.42 (page 464) 
, Injecting a bubble into the memory stage on the next cycle involves checking for 
an exception in either the memory or the write-back stage during the current cycle. 
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# Start injecting bubbles as soon as exception passes through memory stage 
bool M bubble = m.stat in { SADR, SINS, SHLT } || Wstat in { SADR, SINS, SHLT }; 


For stalling the write-back stage, we check only the status of the instruction 
in this stage. If we also stalled when an excepting instruction was in the memory 
stage, then this instruction would not be able to enter the write-back stage. 


bool W.stall = W stat in { SADR, SINS, SHLT }; 


Solution to Problem 4.43 (page 468) 

We would then have a misprediction frequency of 0.35, giving mp = 0.20 x 0.35 x 
2 =0.14, giving an overall CPI of 1.25. This seems like a fairly marginal gain, but 
it would be worthwhile if the cost of implementing the new branch prediction 
strategy were not too high. 


Solution to Problem 4.44 (page 468) 

This simplified analysis, where we focus on the inner loop, is a useful way to 
estimate program performance. As long as the array is sufficiently large, the time 
spent in other parts of the code will be negligible. 


A. The inner loop of the code using the conditional jump has 11 instructions, all 
of which are executed when the array element is zero or negative, and 10 of 
which are executed when the array element is positive. The average is 10.5. 
The inner loop of the code using the conditional move has 10 instructions, 
all of which are executed every time. 


. The loop-closing jump will be predicted correctly, except when the loop 
terminates. For a very long array, this one misprediction will have a negligible 
effect on the performance. The only other source of bubbles for the jump- 
based code is the conditional jump, depending on whether or not the array 
element is positive. This will cause two bubbles, but it only occurs 50% of 
the time, so the average is 1.0. There are no bubbles in the conditional move 
code. 


. Our conditional jump code requires an average of 10.5 4- 1.0 2 11.5 cycles 
per array element (11 cycles in the best case and 12 cycles in the worst), 
while our conditional move code requires 10.0 cycles in all cases. 


Our pipeline has a branch misprediction penalty of only two cycles—far better 
than those for the deep pipelines of higher-performance processors. Asa result, | 
using conditional moves does not affect program performance very much. 
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he primary objective in writing a program must be to make it work correctly 

under all possible conditions. A program that runs fast but gives incorrect 
results serves no useful purpose. Programmers must write clear and concise code, 
not only so that they can make sense of it, but also so that others can read and 
understand the code during code reviews and when modifications are required 
later. 

On the other hand, there are many occasions when making a program run 
fast is also an important consideration. If a program must process video frames or 
network packets in real time, then a slow-running program will not provide the 
needed functionality. When a computational task is so demanding that it requires 
days or weeks to execute, then making it run just 20% faster can have significant 
impact. In this chapter, we. will explore how to make programs run faster via 
several different types of program optimization. 

Writing an efficient program requires several types of activities. First, we 
must select an appropriate set of algorithms and data structures. Second, we 
must write source code that the compiler can effectively optimize to turn into 
efficient executable code. For this second part, it is important to understand the 
capabilities and limitations of optimizing compilers. Seemingly minor changes in 
how a program is written can make large differences in how well a compiler can 
optimize it. Some programming languages are more easily optimized than others. 
Some features of C, such as the ability to perform pointer arithmetic and casting, 
make it challenging for a compiler to optimize. Programmers can often write their 
programs in ways that make it easier for compilers to generate efficient code. A 
third technique for dealing with especially demanding computations is to divide 
a task into portions that can be computed in parallel, on some combination of 
multiple cores and multiple processors. We will defer this aspect of performance 
enhancement to Chapter 12. Even when exploiting parallelism, it is important that 
each parallel thread execute with maximum performance, and so the material of 
this chapter remains relevant in any case. 

In approaching program development and optimization, we must consider 
how the code will be used and what critical factors affect it. In general, program- 
mers must make a trade-off between how easy a program is to implement and 
maintain, and how fast it runs. At an algorithmic level, a simple insertion sort can 
be programmed in a matter of minutes, whereas a highly efficient sort routine 
may take a day or more to implement and optimize. At the coding level, many 
low-level optimizations tend to reduce code readability and modularity, making 
the programs more susceptible to bugs and more difficult to modify or extend. 
For code that will'be executed repeatedly in a performance-critical environment, 
extensive optimization may be appropriate. One challenge is to maintain some 
degree of elegance and readability in the code despite extensive transformations. 

We describe a number of techniques for improving code performance: Ideally, 
a compiler would be able to take whatever code we write and generate the most 
efficient possible machine-level program having the specified behavior. Modern 
compilers employ sophisticated forms of analysis and optimization, and they keep 
getting better. Even the best compilers, however, can be thwarted by optimization 
blockers ——aspects of the program's behavior that depend strongly on the execu- 
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tion environment. Programmers must assist the compiler by writing code that can 
be optimized readily. 

The first step in optimizing,a program is to eliminate unnecessary work, mak- 
ing the code perform its intended task as efficiently as possible. This includes 
eliminating unnecessary function calls, conditional tests, and memory references. 
These optimizations do not depend on any specific properties of the target ma- 
chine. 

To maximize the performance of a program, both the programmer and the 
compiler require a model of the target machine, specifying how instructions are 
processed and-the timing characteristics of the different operations. For example, 
the compiler must know timing information to be able to decide whether it should 
use a multiply instruction or some combination of shifts and adds. Modern com- 
puters use sophisticated techniques to process a machine-level program, executing 
many instructions in parallel and possibly in a different order than they appear in 
the program. Programmers must understand how these processors work to be 
able to tune their programs for maximum speed. We present a high-level model 
of such a machine based on recent designs of Intel and AMD processors. We also 
devise a graphical data-flow notation to visualize the execution of instructions by 
the processor, with which we can predict program performance. 

With this understanding of processor operation, we can take a second step in 
program optimization, exploiting the capability of processors to provide instruc- 
tion-level parallelism, executing multiple instructions simultaneously. We cover 
several program transformations that reduce the data dependencies between dif- 
ferent parts of a computation, increasing the degree of parallelism with which they 
can be executed. ' 

We conclude the chapter by discussing issues related to optimizing large pro- 
grams. We describe the use of code profilers—tools that measure the performance 
of different parts of a program. This analysis can help find inefficiencies in the code 
and identify the parts of the program on which we should focus our optimization 
efforts. ' 

In this presentation, we make code optimization look like a simple linear 
process of applying a series of transformations to the code in à'particular order. 
In fact, the task is not nearly so straightforward. A fair amount of trial-and- 
error experimentation is required. This is especially true as we approach the later 
optimization stages, where seemingly small changes can cause major changes 
in performance and some very promising techniques prove ineffective. As we 
will see in the examples that follow, it can be difficult to explain exactly why a 
particular code sequence has a particular execution time. Performance can depend 
on many detailed features of the processor design for which we have relatively 
little documentation or understanding. This is another reason to try a number of 
different variations and combinations of techniques. 

Studying the assembly-code representation of a program is one of the most 
effective means for gaining an understanding of the compiler and how the gen- 
erated code will run. A good strategy is to start by looking carefully at the code 
for the inner loops, identifying pérformance-reducing attributes such as excessive 
memory references and poor use of registers. Starting with the assembly code, we 
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can also predict what operations will be performed in parallel and how well they 
will use the processor resources. As we will see, we can often determine the time 
(or at least a lower bound on the-time) required to execute a loop by identifying 
critical paths, chains of data dependencies that form during repeated executions 
of a loop. We can then go back and modify the source code to try to steer the 
compiler toward more efficient implementations. 

Most major compilers, including ccc, are continually being updated and im- 
proved, especially in terms of their optimization abilities. One useful strategy is to 
do only as much rewriting of a program as is required to get it to the point where 
the compiler can then generate efficient code. By this means, we avoid compro- 
mising the readability, modularity, and portability of the code as much as if we had 
to work with a compiler of only minimal capabilities. Again, it helps to iteratively 
modify the code and analyze its performance both through measurements and by 
examining the generated assembly code. 

To novice programmers, it might seem strange to keep modifying the source 
code in an attempt to coax the compiler into generating efficient code, but-this 
is indeed how many high-performance programs are written: Compared to the 
alternative of writing code in assembly language, this indirect approach has the 
advantage that the resulting code will still run on other machines, although per- 
haps not with peak performance. 


5.1 Capabilities and Limitations of Optimizing Compilers 


Modern compilers employ sophisticated algorithms to determine what values are 
computed in a program and how they are used. They can then exploit opportuni- 
ties to simplify expressions, to use a single computation in several different places, 
and to reduce the number of times a given computation must be performed. Most 
compilers, including Gcc, provide users with some control over which optimiza- 
tions they apply. As discussed in Chapter 3, the simplest control is to specify the 
optimization level. For example, invoking ccc with the command-line option -0g 
specifies that it should apply a basic set of optimizations. 

Invoking acc with option -01 or higher (e.g., -02 or -03) will cause it to apply 
more extensive optimizations. These can further improve program performance, 
but they may expand the program size and they may make the program more 
difficult to debug using standard debugging tools. For our presentation, we will 
mostly consider code compiled with optimization level -01, even though level 
-02 has become the accepted standard for most software projects that use cc. 
We purposely limit the level of optimization to demonstrate how different ways 
of writing a function in C can affect the efficiency of the code generated bya 
compiler. We will find that we can write C code that, when compiled just with 
option -01, vastly outperforms a more naive version compiled with the highest 
possible optimization levels. 

Compilers must be careful to apply only safe optimizations to a program, 
meaning that the resulting program will have the exact same behavior as would 
an unoptimized version for all possible cases the program may encounter, up to 
the limits of the guarantees provided by the C language standards. Constraining 
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the compiler to perform only safe optimizations eliminates possible sources of 
undesired run-time behavior, but it also means that the programmer must make 
more of an effort to write programs in a way that the compiler can then transform 
into efficient machine-level code. To appreciate the challenges of deciding which 
prógram transformations are safe or not, consider the following two procedures: 


void twiddlei(long *xp, long *yp) 
1 

*Xp += *yp; 

*xp += *yp; 


void twiddle2(long *xp, long *yp) 
1 
*Xp += 2* *yp; 


1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


1 } 


At first glance, both procedures seem to have identical behavior. They both 
add twice the value stored at the location designated by pointer yp to that desig- 
nated by pointer xp. On the other hand, function twiddle2 is more efficient. It 
requires only three memory references (read *xp, read *yp, write *xp), whereas 
twiddle1 requires six (two reads of «xp, two reads of *yp, and two writes of *xp). 
Hence, if a compiler is given procedure twiddle to compile, one might think 
it could generate more efficient code based on the computations performed by 
twiddle2. 

Consider, however, the case in which xp and yp are equal. Then function 
twiddle1 will perform the following computations: 


3 *xp += *xp; /* Double value at xp */ 
4 *xp += *xp; /* Double value at xp */ 


The result will be that the value at xp will be increased by a factor of 4. On the 
other hand, function twiddle2 will perform the following computation: 


9 *xp += 2* *xp; /* Triple value at xp */ 


The result will be that the value at xp willbe increased by a factor of 3. The compiler 
knows nothing about how twiddlet will be called, and so it must assume that 
arguments xp and yp can be equal. It therefore cannot generate code in the style 
of twiddle2 as an optimized version of twiddle. 

The case where two pointers may designate the same memory location is 
known as memory aliasing. In performing only safe optimizations, the compiler 
must assume that different pointers may be aliased. As another example, for a 
program with pointer variables p and q, consider the following code sequence: 


x = 1000; y = 3000; 

*q = y; /* 3000 */ 

*p= x; /* 1000 */ 

ti = *q; /* 1000 or 3000 */ 
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The value computed for t1 depends on whether or not pointers p and q are 
aliased—if not, it will equal 3,000, but if so it will equal 1,000. This leads to one 
of the major optimization blockers, aspects of programs that can-severely limit 
the opportunities for a compiler to generate optimized code. If'à compiler cannot 
determine whether-or not two pointers may be aliased, it must assume that either 
case is possible, limiting the set of possible optimizations. 


program behavior. Consider the following procedure to swap two values: t 
/* Swap value x at xp with value y at yp */ 
void swap(long *xp, long *yp) 


*xp * *yp; /* xty */ 
= *xp - *yp; /* x+y-y = x */ 
*xp - *yp; /* xty-x y */ 


If this procedure is cálléd with xp equal to yp, what effect will it have? 


A second optimization blocker is due to function calls. As an Éxample, con- 
sider the following two procedures: 


long fO; 
long funciO f 
return f() + fO + £O + fO; 


long func2() { 
return 4*f(); 


1 
2 
3 
4 
5 ] 
6 
7 
8 
9 


} 


It might seemat first that both compute the same result, but with func2 calling 
f only once, whereas func1 calls it four.times. It is tempting to generate code in 
the style of func2 when given funci as thé source. 

Consider, however, the following code for f: 


long counter = 0; 


1 
2 

3 long £O { 
4 return countert*; 
5 


T 


This function has a side effect—it modifies some part of the global program state. 
Changing the number of times it gets called changes the program behavior. In 
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Code involving function ‘callé‘can be optimized by 4 process'knoWn s ‘iriline Substitution (of simply 
"inlining"); Where the futictiomcall'is feplaced by-the éódé fof the!body of. thé function. For example, 
we can expand the cod& for fühct by substituitihs Our iistantiations of function: 


1 /* Result of inlining f in funci */ f ts 

2 long funciin() { to fu x i "ü 9 oe dui og i 

3 long t = couņter++; /* +0 %/ 

4 * "to*ecountereep * y Cels. gos owes A gs gts 
5 t += Couüter£y] > D'"/K Qpà/ s FA Wu E 
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This transformation both xedüces the overhead of the function calls ånd allows further optimization of 
the expanded code, For example, the compiler tah consolidate the updates of global variable counter 
in funciin to genergte an optimized version of the Fungtion; " ee 


1 /* Optimization, of inlinedycode */ s, a.f a E" i 
2 .Jiong,functoptO { Quem LE Fy ay a 
3 long..t, = 4 s. counter,f 6; « Ste p " 
4 counterít- 4;  ,,. « ¥ do P ASA l 
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This code faithfully reproduces the behavior of fuic br this particular défiditiori of function f. 

Recent versions Óf  óc*attempt!this form of ‘optimization, éither wheh “directed to with the 
command-line option -finline or fot optiniization. level.-01;and higher. Unfortunately, ccc only 
attempts inlining for.fünctións definéd within a‘singlé file, That. iriéáns it will not Be applied in the 
common case Where a set of libraty fünctions is‘defined iii ópe file but invokéd-by functions in other 
files. ‘8 ii AMBAE 

There are timés when it is best to ptevent-a córhpilér front performing inline’ substitution. One 
is when the code will be evaluated using, a symbolic debugger; such as cps, as described in Section 
3.10.2. If a function'call has been optimizéd away-via inliné substitution, then any attempt to trace or 
set a breakpoint for that call will fail. The:second’is ‘when eyafuating the performance of a program 
by profiling, as is discussed in Section §.14:1.-Calls;to functions that have been éliminated by inline 
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particular, a call to func1 would return 0 4- 1--2--3— 6, whereas a call to func2 
would return 4 - 0 — 0, assuming both started with global variable counter set to 
Zero. 

Most compilers do not try to determine whether a function is free of side 
effects and hence is a candidate for optimizations such as those attempted in 
func2. Instead, the compiler assumes the worst case and leaves function calls 
intact. 
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Among compilers, Gcc is considered adequate, but not exceptional, in terms 
of its optimization capabilities. It performs basic optimizations, but it does not per- 
form the radical transformations on programs that more “aggressive” compilers 
do. As a consequence, programmers using ccc must put more effort into writing 
programs in a way that simplifies the compiler's task of generating efficient code. 


5.2 Expressing Program Performance 


We introduce the metric cycles per element, abbreviated CPE, to express program 
performance in a way that can guide us in improving the code. CPE measure- 
ments help us understand the loop performance of an iterative program at a 
detailed level. It is appropriate for programs that perform a repetitive compu- 
tation, such as processing the pixels in an image or computing the elements in a 
matrix product. 

The sequencing of activities by a processor is controlled by a clock providing 
a regular signal of some frequency, usually expressed in gigahertz (GHz), billions 
of cycles per second. For example, when product literature characterizes a system 
as a “4 GHz” processor, it means that the processor clock runs at 4.0 x 10? cycles 
per second. The time required for each clock cycle is given by the reciprocal of 
the clock frequency. These typically are expressed in nanoseconds (1 nanosecond 
is 107? seconds) or picoseconds (1 picosecond is 1071? seconds). For example, 
the period of a 4 GHz clock can be expressed as either 0.25 nanoseconds or 250 
picoseconds. From a programmer's perspective, it is more instructive to express 
measurements in clock cycles rather than nanoseconds or picoseconds. That way, 
the measurements express how many instructions are being executed rather than 
how fast the clock runs. 

Many procedures contain a loop that iterates over a set of elements. For 
example, functions psumi and psum2 in Figure 5.1 both compute the prefix sum 
of a vector of length n. For a vector à = (ap, a), . . - , a4 .,), the prefix sum p = 
(pos Pi» ---» Pn~1) i5 defined as 


Po = 40 


5.1 
Pi = Piit, lsi<n GY 


Function psum1 computes one element of the result vector per iteration. Func- 
tion psum2 uses a technique known as loop unrolling to compute two elements per 
iteration. We will explore the benefits of loop unrolling later in this chapter. (See 
Problems 5.11, 5.12, and 5.19 for more about analyzing and optimizing the prefix- 
sum computation.) 

The time required by such a procedure can be characterized as a constant plus 
a factor proportional to the number of elements processed. For example, Figure 5.2 
shows a plot of the number of clock cycles required by the two functions for a 
range of values of n. Using a least squares fit, we find that the run times (in clock 
cycles) for psum1 and psum2 can be approximated by the equations 368 + 9.0n and 
368 + 6.0n, respectively. These equations indicate an overhead of 368 cycles due 
to the timing code and to initiate the procedure, set up the loop, and complete the 
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1 /* Compute prefix sum of vector a */ 

2 void psumi(float af], float p[], long n) 

3 1 

4 long i; 

5 plo] = a[01; ‘ 
6 for (i = 1; i < n; i++) 

7 pli] = p[i-1] + a[i]; 

8 } 

9 

10 void psum2(float a[], float p[], long n) 

11 1 

12 long i; 

13 plo] = afol; 

14 for (i = 1; i < n-1; i«-2) { 

15 float mid_val = p[i-1) + ali]; 

16 p[i] = midival; 

17° pliti] = mid val + a[i*1]; 

18 yl 

19 /* For even n, finish remaining element */ 
2b if (i <a) 

21 pli] = p[i-1] + ali]; 

22} : 


Figure 5.1 Prefix-sum functions. These functions provide examples for how we express 
program performance. 
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Figure 5:2. Performance of prefix-sum functions. The slope of the lines indicates the 
number of clock cycles per element (CPE). 
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Aside What is a least squares fit? 


For a set of data points (xy, y1), * Pg. Yn), We-oftén.try to draw aline that beat’ apptdximates the X- 
Y trend represented by these data. With & least $quares fit, we look for a line of the form p- mx -+b 
that minimizes the followirig error meastire: 


E(m, b) = Di (mx, + b= y 


i=l,n * 


An algorithm for computing m and b can be derived by finding the CorivenNey, 9f, En, b) avith respect 
to m and b and sang themi to 0. y 


* 


procedure, plus a linear factor of 6.0 or 9.0 cycles per element. For large values 
of n (say, greater than 200), the run times will be dominated by the linear factors. 
We refer to the coefficients in these terms as the effective number of cycles per 
element. We prefer measuring the number of cycles per element rather than the 
number of cycles per iteration, because techniques such as Joop unrolling allow us 
to use fewer iterations to complete the computation, but our ultimate concern is 
how fast the procedure will run for a given vector length. We focus our efforts on 
minimizing the CPE for our computations. By this measure, psum2, with a CPE of 
6.0, is superior to psum, with a CPE of 9.0. 


Practice Problem:5.2: Goian Gage 57315. aeit ice mra Ro tes 

Later in this chapter we will start with a single function and generate many differ- 
ent variants that preserve the function’s behavior, but with different performance 
characteristics. For three of these variants, we found that the run times (in clock 
cycles) can be approximated by the following functions: 


Version 1: 60 + 35n 
Version 2: 136 + 4n 
Version 3: 157 + 1.25n 


For what values of n would each version be the fastest of the three? Remember 
that n will always be an integer. 


É 


5.3 Program Example 


To demonstrate how an abstract program can be systematically transformed into a 


more efficient code, we. will use a running example based on the vector data 
structure shown in Figure 5.3. À vector is represented with two blocks of memory: 
the header and the data array. The header is a structure declared as follows: 
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len 
data 





Figure 5.3 Vector abstract data type. A vector is represented by header information 
plus an array of designated length. 








code/opt/vec.h 
1  /* Create abstract data type for vector */ 
2 typedef struct 1 
3 long len; 
4 data, t *data; 
5 ) vec. xec, *vec ptr; 
code/opt/vec.h 


The declaration uses data, t to designate the data type of the underlying elements. 
In our evaluation, we measured the performance of our code for integer (C int 
and long), and floating-point (C float and double) data. We do this by compiling 
and running the program separately for different type declarations, such as the 
following for data type long: 


typedef long data_t; 


We allocate the data array block to store the vector elements as an array of len 
objects of type data_t. 

Figure 5.4 shows some basic procedures for generating vectors, accessing vec- 
tor elements, and determining the length of a vector. An important feature to note 
is that get_vec_element, the vector access routine, performs bounds checking for 
every vector reference, This code is similar to the array representations used in 
many other languages, including Java. Bounds checking reduces the chances of 
program error, but it can also slow down program execution. 

As an optimization example, consider the code shown in Figure 5.5, which 
combines all of the elements in a vector into a single value according to some 
operation. By using different definitions of compile-time constants IDENT and 
OP, the code can be recompiled to perform different operations on the data. In 
particular, using the declarations 


#define IDENT 0 
#define OP + 


it sums the elements of the vector. Using the declarations 


#define IDENT 1 
#define OP * 


it computes the product of the vector elements. 
In our presentation, we will proceed through a series of transformations of 
the code, writing different versions of the combining function. To gauge progress, 
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a ss code/opt/vec.c 


1 
2 
3 
4 
5 
6 
7 
8 
9 


10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 


/* Create vector of specified length */ 
vec_ptr new_vec(long len) 
{ 
/* Allocate header structure */ 
vec_ptr result = (vec_ptr) malloc (sizeof (vec_rec)); 
data_t *data = NULL; 
if (!result) 
return NULL; /* Couldn't allocate storage */ 
result-»len = len; 
/* Allocate array */ 
if (len > 0) { 
data = (data_t *)calloc(len, sizeof (data_t)); 
if (!data) i 
free((void *) result); 
return NULL; /* Couldn't allocate storage */ 
} 
Y 
/* Data will either be NULL or allocated array */ 
result-»data - data; 
return result; 


/* 
* Retrieve vector element and store at dest. 
* Return O (out of bounds) or 1 (successful) 
*/ L 
int get vec element(vec ptr v, long index, data t *dest) 
1 

if (index < 0 || index >= y-»len) 

return 0; 
*dest = v-»data[index]; 
return 1; 


} 


/* Return length of vector x/ 
long vec_length(vec_ptr v) 
1 

return v->len; 


} 


Bo codelopiikt 


Figure 5.4 Implementation of vector abstract data type. In the actual program, data 1 
type data t is declared to be int, long, float, dr doüble. 


t 
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/* Implementation with maximum use of data abstraction */ 
void combinei(vec ptr v, data, t *dest) 
1 


long i; 


*dest - IDENT; 

for (i = 0; i < vec. length(v) ; i++) { 
data t val; 
get.vec element(v, i, &val); 
*dest = *dest OP val; 


} 


Figure 5.5. Initial implementation of combining operation. Using different decla- 
rations of idéntity element IDENT and combining operation OP, we cah measure the 
routine for different operations. i 


we measured the CPE performance of the functions on a machine with an Intel 
Core i7 Haswell processor, which we refer to as our reference machine. Some 
characteristics of this processor were given in Section 3.1. These measurements 
characterize performance in terms of how the programs run on just one particular 
machine, and so there is no guarantee of comparable performance on other 
combinations of machine and compiler. However, we have compared the results 
with those for a number of different compiler/processor combinations, and we 
have found them generally consistent with those presented here. 

As we proceed through a set of transformations, we will find that many 
lead to only minimal performance gains, while others have more dramatic ef- 
fects. Determining which combinations of transformations to apply is indeed 
part of the “black art” of writing fast code. Some combinations that do not pro- 
vide measurable benefits are indeed ineffective, while others are important as 
ways to enable further optimizations by the compiler. In our experience, the 
best approach involves a combination of experimentation and analysis: repeat- 
edly attempting different approaches, performing measurements, and examining 
the assembly-code representations to identify underlying performance bottle- 
necks, 

As a starting point, the following table shows CPE measurements for 
conbinei running on our reference machine, with different combinations of 
operation (addition or multiplication) and data type (long integer and double- 
precision floating point). Our experiments with many different programs showed 
that operations on 32-bit and 64-bit integers have identical performance, with 
the exception of code involving division operations. Similarly, we found identical 
performance for programs operating on single- or double-precision floating-point 
data. In our tables, we will therefore show only separate results for integer data 
and for floating-point data. 
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Integer Floating point 

; Function Page Method + * + * 
| combinei 507 Abstract unoptimized 22.68 20.02 1998 2018 
combine1 507 Abstract -01 10.12 10.12 10.17 11.14 


We can see that our measurements are somewhat imprecise. The more likely 
CPE number for integer sum is 23.00, rather than 22.68, while the number for 
integer product is likely 20.0 instead of 20.02. Rather than “fudging” our numbers 
to make them look good, we will present the measurements we actually obtained. 
There are many factors that complicate the task of reliably measuring the precise 
number of clock cycles required by some code sequence. It helps when examining 
these numbers to mentally round the results up or down by a few hundredths of 
a clock cycle. 

The unoptimized code provides a direct translation of the C code into machine 
code, often with obvious inefficiencies. By simply giving the command-line option 
-01, we enable a basic set of optimizations. As can be seen, this significantly 
improves the program performance—more than a factor of 2—with no effort 
on behalf of the programmer. In general, it is good to get into the habit of 
enabling some'level of optimization. (Similar performance results were obtained 
with optimization level -Og.) For the remainder of our measurements, we use 
optimization levels -01 and -02 when generating and measuring our programs. 


5.4 Eliminating Loop Inefficiencies 


Observe that procedure combinei, as shown in Figure 5.5, calls function vec. 
length as the test condition of the for loop. Recall from our discussion of how 
to translate code containing loops into machine-level programs (Section 3.6.7) 
that the test condition must be evaluated on every iteration of the loop. On the 
other hand, the length of the vector does not change as the loop proceeds. We 
could therefore compute the vector length only once and use this value in our test 
condition. i 

Figure 5.6 shows a modified version called combine2. It calls vec. length at 
the beginning and assigns the result to a local variable length. This transformation 
has noticeable effect on the overall performance for some data types and oper- 
ations, and minimal or even none for others. In any case, this transformation is 
required to eliminate inefficiencies that would become bottlenecks as we attempt 
further optimizations. 





Integer Floating point 
Function Page Method * * * * 
combinei 507 Abstract -01 10.12 10.12 10.17 11.14 
combine2 509 Move vec, length 7.02 9.03 9.02 11.03 


This optimization is an instance of a general class of optimizations known as 
code motion. They involve identifying a computation that is performed multiple 4 


| 
| 
| 
| 
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1 /* Move call to vec length out of loop */ 
2 void combine2(vec ptr v, data t *dest) 
3 4 

4 long i; 

5 long length = vec_length(v); 

6 

7 *dest = IDENT; 

8 for (i = 0; i < length; i++) { 

9 data_t val; 

10 get_vec_element(v, i, &val); 

11 *dest = *dest OP val; 

12 j 

3 3 


Figure 5.6 Improving the efficiency of the loop test. By moving the call to vec_ 
length out of the loop test, we eliminate the need to execute it on every iteration. 


times, (e.g., within a loop), but such that the result of the computation will not 
change. We can therefore move the computation to an earlier section of the code 
that does not get evaluated as often. In this case, we moved the call to vec. length 
from within the loop to just before the loop. 

Optimizing compilers attempt to perform code motion. Unfortunately, as dis- 
cussed previously, they are typically very cautious about making transformations 
that change where or how many times a procedure is called. They cannot reliably 
detect whether or not a function will have side effects, and so they assume that 
it might. For example, if vec, length had some side effect, then combine1 and 
conbine2 could have different behaviors. To improve the code, the programmer 
must often help the compiler by explicitly performing code motion. 

Asanextreme example of the loop inefficiency seen in combine1, consider the 
procedure lower1 shown in Figure 5.7. This procedure is styled after routines sub- 
mitted by several students-as part of a network programming project. Its purpose 
is to convert-all of the uppercase letters in astring-to lowercase. The procedure 
steps through the string, converting each uppercase character to lowercase. The 
case conversion involves shifting characters in the range ‘A’ to ‘P’ to the range ‘a’ 
to ‘z’. 

The library function strlen is called as part of the loop test of Lower1. Al- 
though strlen is typically implemented with special x86 string-processing instruc- 
tions, its overall exécution is similar to the simple version that is also shown in 
Figute-5.7. Since strings in C are null-terminated character’sequences, strlen can 
only determine the length of a string by stepping through the sequence until it 
hits a null character: For a string of length n, strlen takes time proportional to n. 
Since strlen is called in each of the » iterations of lower1, the overall run time 
of lower1 is quadratic in the string length, proportional to n?. 


509 
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1  /* Convert string to lowercase: slow */ 
2 void lowerl(char *s) 

3 (t 

4 long i; 

5 

6 for (i = 0; i < strlen(s); i++) 

7 if (s[i] >= 'A' && s[i] <= '2') 
8 s[i] -= CA' - 'a'); 

9 } 

10 


11 /* Convert string to lowercase: faster */ 
12 void lower2(char *s) 

13 1 

14 long i; 

15 long len = strlen(s); 


17 for (i = 0; i < len; i++) 

18 if (s[i] >= 'A' kk s[i] <= 'Z') 
19 s[i] -= ('A' - 'a'); 

20 } 





22 /* Sample implementation of library function strlen */ 
23  /* Compute length of string */ 
24 size t strlen(const char *s) 


25 { 

26 long length = 0; 

27 while (*s != 'NO'O { 
28 Stt; 

29 lengtht+; 

30 } 

31 return length; 

32 } 


Figure 5.7 Lowercase conversion routines. The two procedures have radically different 
performance. 2 


This analysis is confirmed by actual measurements of the functions for differ- 1 

ent length strings, as shown in Figure 5.8 (and using the library version of strlen). E 
The graph of the run time for lower1 rises steeply as the string length increases 
(Figure 5.8(a)). Figure 5.8(b) shows the run times for seven different lengths (not 
the same as shown in the graph), each of which is a power of 2. Observe that for 
loweri each doubling of the string length causes a quadrupling of the'run time. 
This is a clear indicator of a quadratic run time. For a string of length 1,048,576, 
lower1 requires over 17 minutes of CPU time. 
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250 
200 
$ 
5 15 
a 
= 10 
o 
ta ie tot Và et a £3 DA N oh A 1 sat ea AE- 3 — 
0 100,000 200,000 300,000 400,000 500,000 
(a) String length 
String length 
Function 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 
loweri 0.26 1.03 4.10 16.41 65.62 262.48 1,049.89 
lower2 0.0000 0.0001 0.0001 0.0003 0.0005 0.0010 0.0020 
(b) 


Figure 5.8 Comparative performance of lowercase conversion routines. The original code lower1 has a 
quadratic run time due to an inefficient loop structure. The modified code lower2 has a linear run time. 


Function lower2 shown in Figure 5.7 is identical to that of lower1, except 
that we have moved the call to strlen out of the loop. The performance im- 
proves dramatically. For a string length of 1,048,576, the function requires just 2.0 
milliseconds—over 500,000 times faster than lower1. Each doubling of the string 
length causes a doubling of the run time—a clear indicator of linear run time. For 
longer strings, the run-time improvement will be even greater. 

In an ideal world, a compiler would recognize that each call to strlen in 
the loop test will return the same resuit, and thus the call could be moved out of 
the loop. This would require a very sophisticated analysis, since strlen checks 
the elements of the string and these values are changing as lower1 proceeds. The 
compiler would need to detect that even though the characters within the string are 
changing, none are being set from nonzero to zero, or vice versa. Such an analysis 
is well beyond the ability of even the most sophisticated compilers, even if they 
employ inlining, and so programmers must do such transformations themselves. 

This example illustrates a common problem in writing programs, in which a 
seemingly trivial piece of code has a hidden asymptotic inefficiency. One would 
not expect a lowercase conversion routine to be a limiting factor in a program's 
performance. Typically, programs are tested and analyzed on small data sets, for 
which the performance of lower1 is adequate. When the program is ultimately 
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deployed, however, it is entirely possible that the procedure could be applied to 
strings of over one million characters. All of a sudden this benign piece of code 
has become a major performance bottleneck. By contrast, the performance of 
lower2 will be adequate for strings of arbitrary length. Stories abound of major 
programming projects in which problems of this sort occur. Part of the job of a 
competent programmer is to avoid ever introducing such asymptotic inefficiency. 


long min(long x, long y) ( return x € y ? x: y; } 
long max(long x, long y) { return x €« y ? y : x; } 
void incr(long *xp, long v) { *xp t= v; ) 

long square(long x) { return x*x; ) 


The following three code fragments call these functions: 
A. for (i = min(x, y); i « max(x, y); incr(ki, 1)) 


t += square(i); 


B. for (i = max(x, y) - i; i >= min(x, y); incr(&i, -1)) 
t += square(i); 


long low - min(x, y); 
long high = max(x, y); 


for (i = low; i < high; incr(&i, 1)) 
t *- square(i); 


Assume x equals 10 and y equals 100. Fill in the following table indicating the 1 
number of times each of the four’ functions is called in code fragments A-C: 


Code min max incr square 


5.5 Reducing Procedure Calls 


As we have seen, procedure calls can incur overhead and also block most forms of 
program optimization. We can see in the code for combine2 (Figure 5.6) that get. 

vec. element is called on every loop iteration to retrieve the next yector element. 

This function checks the vector index i against the'loop bounds With every vector 

reference, a clear source of inefficiency. Bounds checking might be a useful feature! 
when dealing: with arbitrary array accesses, but a.simple analysis of the code for} 
combine2 shows that all references will be valid. 
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Tm anne code/opt/vec.c 
1 data t *get vec start(vec ptr v) 

2 { x 

3 return v-»data; 

4 


} 
TT an code/opt/vec.c 


/* Direct access to vector data */ 
void combine3(vec_ptr v, data_t *dest) 
1 

long i; 

long length - vec length(v); 

data t *data = get vec start(v); 


*dest - IDENT; 
for (i = 0; i < length; i++) { 
*dest = *dest OP data[i]; 
} 
i «x 


Figure*5.9 Eliminating function calls within the loop. The resulting code does not 
show a performance gain, but it enables additional optimizations. 


‘Suppose instead that we add a function get_vec_start to our abstract data 
type. This ‘function returns, the starting address of the data array, as shown in 
Figure 5.9. We could then write the procedure shown as.combine3 in this figure, 
having no function calls in the inner loop. Rather than making a function call to 
retrieve each vector element, it accesses the array directly. A purist might say that 
this transformation seriously impairs the program modularity. In principle, the 
user of the vector abstract data type should not even need to know that the vector 
contents are stored as an array, rather than as some other data structure such as a 
linked list. A more pragmatic programmer would argue that this transformation 
is a necessary step toward achieving high-performance results. 


Integer Floating point 





Function Page Method + * + * 


ee E ER ERE 
combine2 509 Move vec. length 7.02 9.03 9.02 11.03 


combine3 513 Direct data access 747 9.02 9.02 11.03 


Surprisingly, there is no apparent performance improvement. Indeed, the 
performance for integer sum has gotten slightly worse. Evidently, other operations 
in the inner loop are forming a bottleneck that limits the performance more 
than the call to get. vec. elenent. We will return to this function later (Section 
5.11.2) and see why the repeated bounds checking by combine2 does not incur a 
performance penalty. For now, we can view this transformation as one of a series 

į of steps that will ultimately lead to greatly improved performance. 
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5.6 Eliminating Unneeded Memory References 


The code for conbine3 accumulates the value being computed by the combining 
operation at the location designated by the pointer dest. This attribute can be seen 
by examining the assembly code generated for the inner loop of the compiled 
code. We show here the x86-64 code generated for data type double and with 
multiplication as the combining operation: 


Inner loop of combine3. data.t = double, OP = * 

dest in %rbx, datati in %rdx, data*length in #rax 

.L17: loop: 
vmovsd  (Árbx), %xmmO Read product from dest 
vmulsd (%rdx), %xmmO0, "Ammo Multiply product by datali] 
vmovsd %xmm0, (Arbx) Store product at dest 
addq $8, Ardx Increment datati 
cmpq "rax, %rdx Compare to data*length 
jne .LA7 If !-, goto loop 


We see in this loop code that the address corresponding to pointer dest is held in 
register %rbx. It has also transformed the code to maintain a pointer to the ith data 
element in register Ardx, shown in the annotations as datati.This pointer is in- 
cremented by 8 on every iteration. The loop termination is detected by comparing 
this pointer to one stored in register %rax. We can see that the accumulated value 

is tead from and written to memory on each iteration. This reading abd writing is 4 
wasteful, since the value read from dest at the beginning of each iteration should 4 
simply be the value written at the end of the previous iteration. 

We can eliminate this needless reading and writifig of memory by rewriting the 
code in the style of combine4 in Figure 5.10. We introduce a temporary vatiable j 
acc that is used in the loop to accumulate the computed value. The result is stored 
at dest only after the loop has been completed. As the assembly code that follows y 
shows, the compiler can now use register %xmm0'to hold the accumulated value. 4 
Compared to the loop in combine3, we have reduced the memory operations per j 
iteration from two reads and one write to just a single read. 


Inner loop of combine4. data t = double, OP = * 

acc in %xmmO0, datati in Xrdx, data*length in Arax 

.L25: loop: 
vmulsd (%rdx), %xmmO0, %xmmO Multiply acc by data[i] 
addq $8, Wrdx Increment datati 
cmpq Yrax, 4rdx Compare to data*length 
jne .L25 If !-, goto loop 


We see a significant improvement in program performance, as shown in the 3 
following table: 
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/*-Accumulate result in local variable */ 
void combine4(vec ptr v, data_t *dest) 
{ 
long i; 
long length = vec, length(v); 
data t *data = get_vec_start(v); 
data t acc - IDENT; 


o ON O 0 Fw MN — 


for (i = 0; à < length; i++) ( 
acc = acc OP datali]; 


= 
e 


} 


*dest = acc; 


-—- 
N = 


33 J 


Figure 5.10 Accumulating result in temporary. Holding the accumulated value in local 
variable acc (short for "accumulator") eliminates the need to retrieve it from memory 
and write back the updated value on every loop iteration. 


Integer Floating point 
Function Page Method * * + * 


uc UN ENT CN ——— MR MN 
combine3 513 Direct data access 7.17 9,02 9.02 11.03 


combine4 515 Accumulate in temporary 1.27 3.01 3.01 5.01 





All of our titties improve by factors ranging from 22x to 5.7x, with the integer 
addition case dropping to just 1.27 clock cycles per element. 

Again, one might think that a compiler should be able to'automatically trans- 
form the combine3 code shown in Figure 5.9 to accumulate the value in a register, 
as it does with the code for combine4 shown in Figure 5.10. In fact, however, the 
two functions can have different behaviors due to memory aliasing. Consider, for 
example, the case of integer data with multiplication as the operation and 1 as the 
identity element. Let v = [2, 3, 5] be a vector of three elements and consider the 
following two function calls: 


combine3(v, get_vec_start(v) + 2); 
combine4(v, get_vec_start(v) + 2); 


That is, we create an alias between the last element of the vector and the destina- 
tion for storing the result. The two functions would then execute as follows: 


Function Initial Before loop i=0 i=l i=2 Final 


combine3 [2,3,5] [2,3,1] [2,3,2] [2.36] [2,3,36] [2,3,36] 
combine4 [2,3,5] [2,3,5] [2,3,5]  [2, 3, 5] (2, 3, 5] [2, 3, 30] 


515 
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As shown previously, combine3 accumulates its result at the destination, 
which in this case is the final vector element. This value is therefore set first to 
1, then to2-1=2, and then to 3:2 — 6. On the last iteration, this value is then 
multiplied by itself to yield a final value of 36. For the case of combine4, the vector 
remains unchanged until the end, when the final element is set to the computed 
result 1-2: 3: 5 — 30. 

Of course, our example showing the distinction between combine3 and 
combine4 is highly contrived. One could argue that the behavior of combine4 
more closely matches the intention of the function description. Unfortunately, a 
compiler cannot make a judgment about the conditions under which a function 
might be used and what the programmer's intentions might be. Instead, when 
given combine3 to compile, the conservative approach is to keep reading and 
writing memory, even though this is less efficient. 
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When we use ccc to compile combine3 with command-line option -02, we get 
code with substantially better CPE performance than with -01: 


Integer Floating point 


combine3 513 Compiled -01 747 9.02 9.02 11.03 1 


combine3 513 Compiled -02 1.60 3.01 3.01 5.01 j 
combine4 515 Accumulate in temporary 1.27 3.01 3.01 5.01 


We achieve performance comparable to that for combine4, except for the case 
of integer sum, but even it improves significantly. On examining the assembly code 
generated by the compiler, we find an interesting variant for the inner loop: 


Inner loop of combine3. data_t = double, OP = *. Compiled -02 
dest in %rbx, data*i in Yrdx, datatlength in Zrax 
Accumulated product in %xmm0 
.L22: loop: 
vmulsd (%rdx), AxmmO, AxmmO Multiply product by data[i] 
addq $8, Ardx Increment datati 
cmpq "rax, 4rdx Compare to data*length 
vmovsd %xmm0, (%rbx) Store product at dest 
jne .L22 If t=, goto loop 


We can compare this to the version created with optimization level 1: 


Inner loop of combine3. data t = double, OP = *. Compiled -01 

dest in Xrbx, data*i in &rdx, data*length in %rax 

LAT: loop: 
wmovsd (%rbx), %xmm0 Read product from dest 
ymulsd (%rdx), %xmm0, %xmmO Multiply product by datali] 
vmovsd %xmm0, CÁArbx) Store product at dest 
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5 addq $8, %rdx Increment datati i 
6 cmpq rax, %rdx Compare to datatlength 
7 jne .L17 If !=, goto loop 


We see that, besides some reordering of instructions, the only difference is that 
the more optimized version does not contain the vmovsd implementing the read 
from the location designated by dest (line 2). 


A. How does the role of register 4xmmO differ in these two loops? 


B. Will the more optimized version faithfully implement the C code of com- 
bine3, including when there is memory aliasing between dest and the vec- 
tor data? 


C. Either explain why this optimization preserves the desired behavior, or give 
an example where it would produce different results than the less optimized 
code. 





With this final transformation, we reached a point where we require just 1.25—5 
clock cycles for each element to be computed. This is a considerable improvement 
over the original 9-11 cycles when we first enabled optimization. We would now 
like to see just what factors are constraining the performance of our code and how 
we can improve things even further. 


5.7 Understanding Modern Processors 


Up to this point, we have applied optimizations that did not rely on any features 
of the target machine. They simply reduced the overhead of procedure calls and 
eliminated some of the critical *optimization blockers" that cause difficulties 
for optimizing compilers. As we seek to push the performance further, we must 
consider optimizations that exploit the microarchitecture of the processor—that is, 
the underlying system design by which a processor executes instructions. Getting 
every last bit of performance requires a detailed analysis of the program as well as 
code generation tuned for the target processor. Nonetheless, we can apply some 
basic optimizations that will yield an overall performance improvement on a large 
class of processors. The detailed performance results we report here may not hold 
for other machines, but the general principles of operation and optimization apply 
to a wide variety of machines. 

To understand ways to improve performance, we require a basic understand- 
ing of the microarchitectures of modern processors. Due to the large number of 
transistors that can be integrated onto a single chip, modern microprocessors em- 
ploy complex hardware that attempts to maximize program performance. One 
result is that their actual operation is far different from the view that is perceived 
by looking at machine-level programs. At the code level, it appears as if instruc- 
tions are executed one at a time, where each instruction involves fetching values 
from registers or memory, performing an operation, and storing results-back to 
a register or memory location. In the actual processor, a number of instructions 
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are evaluated simultaneously, a phenomenon referred to as instruction-level paral- 
lelism. In some designs, there can be 100 or more instructions “in flight.” Elaborate 
mechanisms are employed to make sure the behavior of this parallel execution 
exactly captures the sequential semantic model required by the machine-level 
program. This is one of the remarkable feats of modern microprocessors: they 
employ complex and exotic microarchitectures, in which multiple instructions can 
be executed in parallel, while presenting an operational view of simple sequential 
instruction execution. 

Although the detailed design of a modern microprocessor is well beyond 
the scope of this book, having a general idea of the principles by which they 
operate suffices to understand how they achieve instruction-level parallelism. We 
will find that two different lower bounds characterize the maximum performance 
of a program. The latency bound is encountered when a series of operations 
must be performed in strict sequence, because the result of one operation is 
required before the next one can begin. This bound can limit program performance 
when the data dependencies in the code limit the ability of the processor to 
exploit instruction-level parallelism. The throughput bound characterizes the raw 
computing capacity of the processor's functional units. This bound becomes the 
ultimate limit on program performance. 
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5.7.1 Overall Operation 


Figure 5.11 shows a very simplified view of a modern microprocessor. Our hy- 
pothetical processor design is based loosely on the structure of recent Intel pro- 
cessors. These processors are described in the industry as being superscalar, which 
means they can perform multiple operations on every clock cycle and out of order, 
meaning that the order in which instructions execute need not correspond to their 
ordering in the machine-level program. The overall design has two main parts: 
the instruction control unit (ICU), which is responsible for reading a sequence of 
instructions from memory and generating from these a set of primitive operations 
to perform on program data, and the execution unit (EU), which then executes 
these operations. Compared to the simple in-order pipeline we studied in Chap- 
ter 4, out-of-order processors require far greater and more complex hardware, but 
they are better at achieving higher degrees of instruction-level parallelism. 

The ICU reads the instructions from an instruction cache—a special high- 
speed memory containing the most recently accessed instructions. In general, 
the ICU fetches well ahead of the currently executing instructions, so that it has 
enough time to decode these and send operations down to the EU. One problem, 
however, is that when a program hits a branch, there are two possible directions 
the program might go. The branch can be taken, with control passing to the branch 
target. Alternatively, the branch can be not taken, with control passing to the next 
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1. We use the term “branch” specifically to refer to conditional jump instructions. Other instructions 
that can transfer control to multiple destinations, such as procedure return and indirect jumps, provide 
similar challenges for the processor. 
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Figure 5.11 Block diagram of an out-of-order processor. The instruction control 
unit is responsible for reading instructions from memory and generating a sequence 
of primitive operations. The execution unit then performs the operations and indicates 
whether the branches were correctly predicted. 


instruction in the instruction sequence. Modern processors employ a technique 
known as branch prediction, in which they guess whether or not a branch will be 
taken and also predict the target address for the branch. Using a technique known 
as speculative execution, the processor begins fetching and decoding instructions 
at where it predicts the branch will go, and even begins executing these operations 
before it has been determined whether or not the branch prediction was correct. 
If it later determines that the branch was predicted incorrectly, it resets the state 
to that at the branch point and begins fetching and executing instructions in the 
other direction. The block labeled “Fetch control” incorporates branch prediction 
to perform the task of determining which instructions to fetch. 

The instruction decoding logic takes the actual program instructions and con- 
verts them into a set of primitive operations (sometimes referred to as micro- 
operations). Each of these operations performs some simple computational task 
such as adding two numbers, reading data from memory, or writing data to mem- 
ory. For machines with complex instructions, such as x86 processors, an instruction 
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can be decoded into multiple operations. The details of how instructions are de- 
coded into sequences of operations varies between machines, and this information 
is considered highly proprietary. Fortunately, we can optimize our programs with- 
out knowing the low-level details of a particular machine implementation. 

In a typical x86 implementation, an instruction that only operates on registers, 
such as 


addq %rax,%rdx 


is converted into a single operation. On the other hand, an instruction involving 
one or more memory references, such as 


addq %rax,8(%rdx) : 


yields multiple operations, separating the memory references from the arithmetic 
operations. This particular instruction would be decoded as three operations: one 
to load a value from memory into the processor, one to add the loaded value to the 
value in register 4eax, and one to store the result back to memory. The decoding 
splits instructions to allow a division of labor among a set of dedicated hardware 
units. These units can then execute the different parts of multiple instructions in 
parallel. 

The EU receives operations from the instruction fetch unit. Typically, it can 
receive a number of them on each clock cycle. These operations are dispatched to 
a set of functional units that perform the actual operations. These functional units 
are specialized to handle different types of operations. 

Reading and writing memory is implemented by the load and store units. The 
load unit handles operations that read data from the memory into the processor. 
This unit has an adder to perform address computations. Similarly, the store unit 
handles operations that write data from the processor to the memory. It also has 
an adder to perform address computations. As shown in the figure, the load and 
store units access memory via a data cache, a high-speed memory containing the 
most recently accessed. data values. 

With speculative execution, the operations are evaluated, but the final results 
are. not stored in the program registers or data memory until the processor can 
be certain that these instructions should actually have been executed. Branch 
operations are sent to the EU, not to determine where the branch should go, but 
rather to determine whether or not they were predicted correctly. Ifthe prediction af 
was incorrect, the EU will discard the results that have been computed beyond the 
branch point. It will also signal the branch unit that the prediction was incorrect $ 
and indicate the correct branch destination. In this case, the branch unit begins $ 
fetching at the new location. As we saw in Section 3.6.6, such a mispredictionincurs $E 
a significant cost in performance. It takes a while before the new.instructions can Ma 
be fetched, decoded, and sent to the functional units. 

Figure 5.11 indicates that the different functional units are designed to per- @ 
form different operations. Those labeled as performing "arithmetic operations" 
are typically specialized to perform different combinations of integer and floating- 
point operations. As the number of transistors that can be integrated onto a single 
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microprocessor chip has grown over time, successive models of microprocessors 
have increased the tota] number of functional units, the combinations of opera- 
tions each unit can perform, and the performance of each of these units. The'arith- 
metic units are intentionally designed to be able to perform a variety of different 
operations, since the required operations vary widely across different programs. 
For example, some programs might involve many integer operations, while others 
require many floating-point operations. If one functional unit were specialized to 
perform integer operations while another could only perform floating-point oper- 
ations, then none of these programs would get the full benefit of having multiple 
functional units. 

For example, our Intel Core i7 Haswell reference machine has eight functional 
units, numbered 0-7. Here is a partial list of each one's capabilities: 


0. Integer arithmetic, floating-point multiplication, integer and floating-point 
division, branches 


1. Integer arithmetic, floating-point addition, integer multiplication, floating- 
point multiplication 


2. Load, address computation 
3. Load, address computation 
4. Store 

5. Integer arithmetic 

6. Integer arithmetic, branches 
7. Store address computation 


In the above list, "integer arithmetic" refers to basic operations, such as addition, 
bitwise operations, and shifting. Multiplication and division require more special- 
ized resources. We see that a store operation requires two functional units-—one 
to compute the store address and one to actually store the data. We will discuss 
the mechanics of store (and load) operations in Section 5.12. 

We can see that this combination of functional units has the potential to 
perform multiple operations of the same type simultaneously. It has four units 
capable of performing integer operations, two that can perform load operations, 
andtwo that can perform floating-point multiplication. We will later see the impact 
these resources have on the maximum performance our programs can achieve. 

Within the ICU, the retirement unit keeps track of the ongoing processing and 
makes sure that it obeys the sequential semantics of the machine-level program. 
Our figure shows a register file containing the integer, floating-point, and, more 
recently, SSE and AV X registers as part of the retirement unit, because this unit 
controls the updating of these registers: As an instruction is decoded, information 
about it is placed into a first-in, first-out queue. This information remains in 
the queue until one of two outcomes occurs. First, once the operations for the 
instruction have completed and any branch points leading to this instruction are 
confirmed as having been correctly predicted, the instruction can be retired, with 
any updates to the program registers being made. If some branch point leading 
to this instruction was mispredicted, on the other hand, the instruction will be 
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Aside The history of out-of-order processing : 
Out-of-order processing was-first implemented in the Control Data Corpóration 6600 processor, in 
1964. Instructions were processed by 10 different furictional units, each of which could-bé operated 
independently. In its day, this machine, with a clock rate of 10 MHz, was considered the premium 
machine for scientific computing. T s 

IBM first implemented out-oftorder processing with the IBM 360/91 processor in 1966, but just to 
execute the floating-point instructions. For'aróund 25 years, out-of-order processing was considered 
an exotic techriology, found only in machines striving for the highest possible performance, until 
IBM reintroduced it in the RS/6000 line of workstations in 1990. This design became the basis for 
thé IBM/Motorola PowerPC line, With the model 601, introduced in*1993, becoming the first single- 
chip microprocessor to use out-of-order processirig. Intel'introduced out-of-order processing with its 
PentiumPro model in 1995, with an, underlying, microarchitecture similar to that of our reference 


machine. 







flushed, discarding any results that may have been computed. By this means, 
mispredictions will not alter the program state. 

As we have described, any updates to the program registers occur only as 
instructions are being retired, and this takes place only after the processor can be 
certain that any branches leading to this instruction have been correctly predicted. 
To expedite the communication of results from one instruction to another, much 
of this information is exchanged among the execution units, shown in the figure as 
“Operation results.” As the arrows in the figure show, the execution units can send 
results directly to each other. This is a more elaborate form of the data-forwarding 
techniques we incorporated into our simple processor design in Section 4.5.5. 

The most common mechanism for controlling the communication of operands 
among the execution units is called register renaming. When an instruction that 
updates register r is decoded, a tag t is generated giving a unique identifier to 
the result of the operation. An entry (r, t) is added to a table maintaining the 
association between program register r and tag t for an operation that will update 
this register. When a subsequent instruction using register r as an operand is 
decoded, the operation sent to the execution unit will contain 1 as the source 
for the operand value. When some execution unit completes the first operation, 
it generates a result (v, £), indicating that the operation with tag £ produced 
value v. Any operation waiting for t as a source will then use v as the source 
value, a form of data forwarding. By this mechanism, values can be forwarded 
directly from one operation to another, rather than being written to and read from 
the register file, enabling the second operation to begin as soon as the first has 
completed. The renaming table only contains entries for registers having pending 
write operations. When a decoded instruction requires a register r, and there is no 
tag associated with this register, the operand is retrieved directly from the register 
file. With register renaming, an entire sequence of operations can be performed 
speculatively, even though the registers are updated only after the processor.is 


certain of the branch outcomes. 
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Integer Floating point 
Operation Latency Issue Capacity Latency Issue Capacity 
Addition 1 1 4 3 1 1 
Multiplication 3 1 1 5 1 2 
Division 3-30 3-30 1 3-15 3-15 1 


MEC OR AEn mr s o o NER HE 
Figure 5.12 Latency, issue time, and capacity characteristics of reference machine 
operations. Latency indicates the total number of clock cycles required to perform the 
actual operations, while issue time indicates the minimum number of cycles between 
two independent operations. The capacity indicates how many of these operations can 
be issued simultaneously. The times for division depend on the data values. 
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5.7.2 Functional Unit Performance 


Figure 5.12 documents the performance of some of the arithmetic operations for 
our Intel Core i7 Haswell reference machine, determined by both measurements 
and by reference to Intel literature [49]. These timings are typical for other proces- 
sors as well. Each operation is characterized by its latency, meaning the total time 
required to perform the operation, the issue time, meaning the minimum num- 
ber of clock cycles between two independent operations of the same type, and 
the capacity, indicating the number of functional units capable of performing that 
operation. 

We see that the latencies increase in going from integer to floating-point 
operations. We see also that the addition and multiplication operations all have 
issue times of 1, meaning that on each clock cycle, the processor can start a 
new one of these operations. This short issue time is achieved through the use 
of pipelining. A pipelined function unit is implemented as a series of stages, 
each of which performs part of the operation. For example, a typical floating- 
point adder contains three stages (and hence the three-cycle latency): one to 
process the exponent values, one to add the fractions, and one to round the result. 
The arithmetic operations can proceed through the stages in close succession 
rather than waiting for one operation to complete before the next begins. This 
capability can be exploited only if there are successive, logically independent 
operations to be performed. Functional units with issue times óf 1 cycle are said 
to be fully pipelined: they can start a new operation every clock cycle. Operations 
with capacity greater than 1 arise due to the capabilities of the multiple functional 
units, as was described earlier for the reference machine. 

We see also that the divider (used for integer and floating-point division, as 
wellas floating-point square root) is not pipelined—its issue time equals its latency. 
What this means is that the divider must perform a complete division before it can 
begin anew one. We also see that the latencies and issue times for division are given 
as ranges, because some combinations of dividend and divisor requite more steps 
than others. The long latency and issue times of division make it a comparatively 
costly operation. 
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A more common way of expressing issue time is to specify the maximum 
throughput of the unit, defined as the reciprocal of the issue time. A fully pipelined 
functional unit has a maximum throughput of 1 operation per clock cycle, while 
units with higher issue times have lower maximum throughput. Having muitiple 
functional] units can increase throughput even further. For an operation with 
capacity C and issue time 7, the processor can potentially achieve a throughput of 
C/I operations per clock cycle. For example, our reference machine is capable of 
performing floating-point multiplication operations at a rate of 2 per clock cycle. 
We will see how this capability can be exploited to increase program performance. 

Circuit designers can create functional units with wide ranges of performance 
characteristics. Creating a unit with short latency or with pipelining requires 
more hardware, especially for more complex functions such as multiplication 
and floating-point operations. Since there is only a limited amount of space for 
these units on the microprocessor chip, CPU designers must carefully balance 
the number of functional units and their individual performance to achieve op- 
timal overall performance. They evaluate many different benchmark programs 
and dedicate the most resources to the most critical operations. As Figure 5.12 
indicates, integer multiplication and floating-point multiplication and addition 
were considered important operations in the design of the Core i7 Haswell pro- 
cessor, even though a significant amount of hardware is required to achieve the 
low latencies and high degree of pipelining shown. On the other hand, division 
is relatively infrequent and difficult to implement with either short latency or full 
pipelining. 

The latencies, issue times, and capacities of these arithmetic operations can 
affect the performance of our combining functions. We can express these effects 
in terms of two fundamental bounds on the CPE values: 





Integer Floating point 
‘ Bound + * + * 
3 Latency 100 3.00 3.00 5.00 


| ; Throughput 0.50 1.00 100 0.50 | 


i The latency bound gives a minimum value for the CPE for any function that must 
[ perform the combining operation in a strict sequence. The throughput bound ` | 
: gives a minimum bound for the CPE based on the maximum rate at which the g 
functional units can produce results. For example, since there is only one integer | 
multiplier, and it has an issue time of 1 clock cycle, the processor cannot possibly 
sustain a rate of more than 1 multiplication per clock cycle. On the other hand, 
with four functional units capable of performing integer addition, the processor $E 
can potentially sustain a rate of 4 operations per cycle. Unfortunately, the need | | 
to read elements from memory creates an additional throughput bound. The ME 
i two load units limit the processor to reading at most 2 data values per clock $ 
J cycle, yielding a throughput bound of 0.50. We will demonstrate the effect of $ 
both the latency and throughput bounds with different versions of the combining 
functions. Ó 
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5.7.3 An Abstract Model of Processor Operation 


Asa tool for analyzing the performance of a machine-level program executing ona 
modern processor, we will use a data-flow representation of programs, a graphical 
notation showing how the data dependencies between the different operations 
constrain the order in which they are executed. These constraints then lead to 
critical paths in the graph, putting a lower bound on the number of clock cycles 
required to execute a set of machine instructions. 

Before proceeding with the technical details, it is instructive to examine the 
CPE measurements obtained for function combine4, our fastest code up to this 
point: 


Integer Floating point 
Function Page Method + * + * 


a ee 
combine4 515 Accumulate in temporary 1.27 3.01 3.01 5.01 


Latency bound 1.00 3.00 3.00 5.00 
Throughput bound 0.50 1.00 1.00 0.50 


We can see that these measurements match the latency bound for the proces- 
sor, except for the case of integer addition. This is not a coincidence——it indicates 
that the performance of these functions is dictated by the latency of the sum 
or product computation being performed. Computing the product or sum of n 
elements requires around L -n + K clock cycles, where L is the latency of the 
combining operation and K represents the overhead of calling the function and 
initiating and terminating the loop. 'The CPE is therefore equal to the latency 
bound L. 


From Machine-Level Code to Data-Flow Graphs 


Our data-flow representation of programs is informal. We use it as a way to 
visualize how the data dependencies in a program dictate its performance. We 
present the data-flow notation by working with combine4 (Figure 5,10) as an 
example. We focus just on the computation-performed by the loop, since this is the 
dominating factor in performance for large vectors. We consider the case of data 
type double with multiplication as the combining operation, Other,combinations 
of data type and operation yield similar code. The compiled code for this loop 
consists of four instructions, with registers %rdx holding a pointer to the ith 
element of array data, rax holding a pointer to the end of the array, and %xmm0 
holding the accumulated value acc. 


‘Inner loop of combine4. data_t = double, OP = * 

acc in Zxmm0, data*i in %rdx, data*length in Xrax 

.L25: loop: 
vmulsd (%rdx), %xmm0, %xmm0 Multiply acc by data[i] 
addq $8, %rdx Increment data*i 
cmpq 4rax, Ardx Compare to data*length 
jne .L25 If !=, goto loop 








vmulsd (%rdx), %xmm0, %xmmO 


addq $8,%rdx 
cmpq Arax,Ardx 


jne loop 


Figure 5.13 Graphical representation of inner-loop code for combined. Instructions 
are dynamically translated into one or two operations, each of which receives values 
from other operations or from registers and produces values for other operations and for 
registers. We show the target of the final instruction as the label Loop. It jumps to the 
first instruction shown. 


As Figure 5.13 indicates, with our hypothetical processor design, the four in- 


structions are expanded by the instruction decoder into a series of five operations, 
with the initial multiplication instruction being expanded into a load operation 
to read the source operand from memory, and a mul operation to perform the 
multiplication. 

Asastep toward generating a data-fiow graph representation of the program, 
the boxes and lines along the left-hand side of Figure 5.13 show how the registers 
are used and updated by the different operations, with the boxes along the top 
representing the register values at the beginning of the loop, and'those along the 
bottom representing the values at the end. For example, register %rax is only used 
as a source value by the cmp operation, and so the register bas the same value at 
the end of the loop as at the beginning. Register %rdx, on the other hand, is both 
used and updated within the loop. Its initial value is used by the load and add 
operations; its new value is generated by the add operation, which is then used 
by the cmp operation. Register 4xmmO is also updated within the loop by the mul 
operation, which first uses the initial value as a source value. 

Some of the operations in Figure 5.13 produce values that do not correspond 
to registers. We show these as arcs between operations on the right-hand side. 
The load operation reads a value from memory and passes it directly to the 
mul operation. Since these two operations arise from decoding a single vnulsd 
instruction, there is no register associated with the intermediate value passing 
between them. The cmp operation updates the condition codes, and these are 
then tested by the jne operation. 

For a code segment forming a loop, we can classify the registers that are 
accessed into four categories: 
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Figure 5.14 

Abstracting combine4 
operations as a data-flow 
graph. We rearrange the 
operators of Figure 5.13 
to more clearly show the 
data dependencies (a), and 
then further show only 
those operations that use 
values from one iteration 
to produce new values for 
the next (b). 





(b) 


Read-only. These are used as source values, either as data or to compute mem- 
ory addresses, but they are not modified within the loop. The only read- 
only register for the loop in combine4 is %rax. 


Write-only. 'These are used as the destinations of data-movement operations. 
There are no such registers in this loop. 


Local. These are updated and used within the loop, but there is no dependency 
from one iteration to another. The condition code registers are examples 
for this loop: they are updated by the cmp operation and used by the jne 
operation, but this dependency is contained within individual iterations. 


Loop. These are used both as source values and as destinations for the loop, 
with the value generated in one iteration being used in another. We can 
see that %rdx and %xmm0 are loop registers for combine4, corresponding 
to program values data*i and acc. 


As we will see, the chains of operations between loop registers determine the 
performance-limiting data dependencies. 

Figure 5.14 shows further refinements of the graphical representation of Fig- 
ure 5.13, with a goal of showing only those operations and data dependencies that 
affect the program execution time. We see in Figure 5.14(a) that we rearranged 
the operators to show more clearly the flow of data from the source registers at 
the top (both read-only and loop registers) and to the destination registers at the 
bottom (both write-only and loop registers). 

In Figure 5.14(a), we also color operators white if they are not part of some 
chain of dependencies between loop registers. For this example, the comparison 
(cmp) and branch (jne) operations do not directly affect the flow of data in the 
program. We assume that the instruction control unit predicts that branch will be 
taken, and hence the program will continue looping. The purpose of the compare 
and branch operations is to test the branch condition and notify the ICU if it is 
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not taken. We assume this checking can be done quickly enough that it does not 
slow down the processor. 

In Figure 5.14(b), we have eliminated the operators that were colored white 
on the left, and we have retained only the loop registers. What we have léft is an 
abstract template showing the data dependencies that form among loóp registers 
due to one iteration of the loop. We can see in this diagram that there are two 
data dependencies from one iteration to the next. Along one side, we see the 
dependencies between successive values of program value acc, stored in register 
%xmm0. The loop computes a new value for acc by multiplying the old value by a 
data element, generated by the load operation. Along the other side, we see the 
dependencies between successive values of the pointer to the ith data element. 
On each iteration, the old value is used as the address for the load operation, and 
it is also incremented by the add operation to compute its new value. 

Figure 5.15 shows the data-flow representation of n iterations by the inner loop 
of function combine4. This graph was obtained by simply replicating the template 
shown in Figure 5.14(b) » times. We can see that the program has two chains of data 


Figure 5.15 Critical path 
Data-flow representation b 
of computation by n 

iterations of the inner data[0] 
loop of combine4. The 

sequence of multiplication 

operations forms a critical 

path that limits program 


performance. 
data[1] 
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dependencies, corresponding to the updating of program values acc and data+i 
with operations mul and add, respectively. Given that floating-point multiplication 
has a latency of 5 cycles, while integer addition has a latency of 1 cycle, we can see 
that the chain on the left will form a critical path, requiring 5n cycles to execute. 
The chain on the right would require only n cycles to execute, and so it does not 
limit the program performance. 

Figure 5.15 demonstrates why we achieved a CPE equal to the latency bound 
of 5 cycles for combine4, when performing floating-point multiplication. When ex- 
ecuting the function, the floating-point multiplier becomes the limiting resource. 
The other operations required during the loop—manipulating and testing pointer 
value data*i and reading data from memory—proceed in parallel with the mul- 
tiplication. As each successive value of acc is computed, it is fed back around to 
compute the next value, but this will not occur until 5 cycles later. 

The flow for other combinations of data type and operation are identical to 
those shown in Figure 5.15, but with a different data operation forming the chain of 
data dependencies shown on the left. For all of the cases where the operation has 
a latency L greater than 1, we see that the measured CPE is simply L, indicating 
that this chain forms the performance-limiting critical path. 


Other Performance Factors 


For the case of integer addition, on the other hand, our measurements of combine4 
show a CPE of 1.27, slower than the CPE of 1.00 we would predict based on the 
chains of dependencies formed along either the left- or the right-hand side of the 
graph of Figure 5.15. This illustrates the principle that the critical paths in a data- 
flow representation provide only a lower bound on how many cycles a program 
will require. Other factors can also limit performance, including the total number 
of functional units available and the number of data values that can be passed 
among the functional units on any given step. For the case of integer addition as 
the combining operation, the data operation is sufficiently fast that the rest of the 
operations cannot supply data fast enough. Determining exactly why the program 
requires 1.27 cycles per element would require a much more detailed knowledge 
of the hardware design than is publicly available. 

To summarize our performance analysis of combine4: our abstract data-flow 
representation of program operation showed that combine4 has a critical path of 
length L - n caused by the successive updating of program value acc, and this path 
limits the CPE to at least L. This is indeed the CPE we measure for all cases except 
integer addition, which has a measured CPE of 1.27 rather than the CPE of 1.00 
we would expect from the critical path length. 

It may seem that the latency bound forms a fundamental limit on how fast 
out combining operations can be performed. Our next task will be to restructure 
the operations to enhance instruction-level parallelism. We want to transform the 
program in such a way that our only limitation becomes the throughput bound, 
yielding CPEs below or close to 1.00. 
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prati Propilen 5.5 (solution pade.375) SS LL. s Decus > 
Suppose we wish to write a function to evaluate a polynomial, where a polynomial 
of dégiee n is defined to have a set of coefficients ag, 41, a2, . - - , à. For a value x, 
we evaluate the polynomial by computing 


ap T aix + ax? + + ax" (52) 


This evaluation can be implemented by the following function, having as argu- 
ments an array of coefficients a, a value x, and the polynomial degree degree (the 
value n in Equation 5.2). In this function, we compute both the successive terms 
of the equation and the successive powers of x within a single loop: 


double poly(double a[], double x, long degree) 
i 
long i; 
double result - a[0]; 
double xpwr = x; /* Equals x^i at start of loop */ 
for (i = 1; i <= degree; i++) { 
result += ali] * xpwr; 
xpwr = x * Xpwr; 
} 
return result; 


} 


. For degree n, how many additions and how many multiplications does this 
code perform? 


. On our reference machine, with arithmetic operations having the latencies 
shown in Figure 5.12, we measure the CPE for this function to be 5.00. Ex- 
plain how this CPE arises based on the data dependencies formed between 
iterations due to the operations implementing lines 7-8 of-the function. 


F 


Piactice Ptoblei Sd [shit piae s] odi E sae 
Let us continue exploring ways to evaluate polynomials, as described in Practice 
Problem 5.5. We can redüce the number of multiplications inevaluating a polyno- 
mial by applying Horner's method, named after British mathematician William G. 4 
Horner (1786-1837). The idea is to repeatedly factor out the powers of x to get | 
the following evaluation: 


ag + x(a; + x(a5 ++-  x(asa + xa,):--)) (53) - 


Using Horner's method, we can implement polynomial evaluation;using the | 
following code: 


1 /* Apply Horner's method */ 
2 double polyh(double a[], double x, long degree) 
3 í 
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long i; 

double result = a[degreel; 

for (i = degree-1; i >= 0; i--) 
result = ali] + x*result; 

return result; 


wan wn wu A 


} 


A. For degree n, how many additions and how many multiplications does this 
code perform? 


B. On our reference machine, with the arithmetic operations having the laten- 
cies shown in Figure 5.12, we measure the CPE for this function to be 8.00. 
Explain how this CPE arises based on the data dependencies formed be- 
tween iterations due to the operations implementing line 7 of the function, 


C. Explain how the function shown in Practice Problem 5.5 can run faster, even 
though it requires more operations. 


5.8 Loop Unrolling 


Loop unrolling is a program transformation that reduces the number of iterations 
for a loop by increasing the number of elements computed on each iteration. We 
saw an example of this with the function psum2 (Figure 5.1), where each iteration 
computes two elements of the prefix sum, thereby halving the total number of 
iterations required. Loop unrolling can improve performance in two ways. First, 
it reduces the number of operations that do not contribute directly to the program 
result, such as loop indexing and conditional branching. Second, it exposes ways 
in which we can further transform the code to reduce the number of operations 
in the critical paths of the overall computation. In this section, we will examine 
simple loop unrolling, without any further transformations. 

Figure 5.16 shows a version of our combining code using what we will refer 
to as "2 x 1 loop unrolling.” The first loop steps through the array two elements 
at a time. That is, the loop index i is incremented by 2 on each iteration, and the 
combining operation is applied to array elements i and i + 1in a single iteration. 

In general, the vector length will not be a multiple of 2. We want our code 
to work correctly for arbitrary vector lengths. We account for this requirement in 
two ways. First, we make sure the first loop does not overrun the array bounds. 
For a vector of length n, we set the loop limit to be n — 1. We are then assured that 
the loop will only be executed when the loop index i satisfies i < n — 1, and hence 
the maximum array index i + 1 will satisfy i + 1 < (n — 1) --1— n. 

We can generalize this idea to unroll a loop by any factor &, yielding k x 1 
loop unrolling. To do so, we set the upper limit to be n — k + 1 and within the 
loop apply the combining operation to elements i through i + k — 1. Loop index i 
is incremented by & in each iteration. The maximum array index i + k — 1 will 
then be less than n. We include the second loop to step through the final few 
elements of the vector one at a time. The body of this loop will be executed 
between 0 and k — 1 times. For k = 2, we could use a simple conditional statement 
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long i; 
long length - 
long limit = length-1; 
data t *data = get vec start(v), 
data t acc = IDENT; 


for (i = 
acc = 


17 acc 
18 } 
19 *dest = 


Page 
515 
532 





Latency bound 
Throughput bound 


/* 2 x 1 loop unrolling */ 
void combine5(vec_ptr v, data t *dest) 


vec. length(v); 


/* Combine 2 elements at a time */ 


0; i «'limit; i+=2) { 


acc; 


Method 

No unrolling 

2 x 1 unrolling 
3x Lunrolling 


/* Finish any remaining elements */ 
16 for (; i < length; i++) ( 
= acc OP data(il; 


7 


1 


Integer 


+ 
127 


1.01 
1.01 


1.00 
0.50 


(acc OP data[i]) OP data[i*1]; 


* 


3.01 
3.01 
3.01 


3.00 
1.00 





'Modify the ede fn combined to unroll the loop by a um k=5. 


+ 


3.01 
3.01 
3.01 


3.00 
1.00 



















Figure 5.16 Applying 2 x 1 loop unrolling. This transformation can reduce the effect 


to optionally add a final iteration, as we did with the function psum2 (Figure 5.1). 
For k > 2, the finishing cases are better expressed with a loop, and so we adopt 
this programming convention for k = 2 as well: We refer to this transformation as 
“k x 11oop unrolling,” since we unroll by a factor of k but accumulate values in a 


Single variable acc. 








When we measure, the performance of unrolled code for unrolling factors 
k =2 (combined) and k = 3, we get the following results: 


Floating point 


* 


5.01 
5.01, 
5.01 


5.00 
0.50 
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Figure 5.17 

CPE performance for 

different degrees of 

k x 1 loop unrolling. Only Hom rin 
integer addition improves —A— long * 
with this transformation. motes long + 








Unrolling factor k 


We see that the CPE for integer addition improves, achieving the latency 
bound of 1.00. This result can be attributed to the benefits of reducing loop 
overhead operations. By.reducing the number of overhead operations relative 
to the number of additions required to compute the vector sum, we can reach 
the point where the 1-cycle latency of integer addition becomes the performance- 
limiting factor. On the other hand, none of the other cases improve—they are 
already at their latency bounds. Figure 5.17 shows CPE measurements when 
unrolling the loop by up to a factor of 10. We see that the trends we observed 
for unrolling by 2 and 3 continue—none go below their latency bounds. 

To understand why & x 1 unrolling cannot improve performance beyond 
the latency bound, let us examine the machine-level code for the inner loop of 
combines, having k = 2. The following code gets generated when type data_t is 
double, and the operation is multiplication: 


Inner loop of combine5. data_t = double, OP = * 
i in rdx, data Zrax, limit in %rbx, acc in Y%xmmO 


1 .L35: loop: 

2 vmulsd (%rax,%rdx,8), %xmm0, %xmm0 Multiply acc by data[i] 

3 vnulsd 8(A4rax,Xrdx,8), %xmm0, %xmm0 Multiply acc by data[i*1] 
4 addq $2, %rdx Increment i by 2 

5 cmpq 4rdx, %rbp Compare to limit:i 

6 j£ .L35 If >, goto loop 


We can see that Gcc uses. a more direct translation of the array referencing 
seen in the C code, compared to the pointer-based code generated for combine4.? 
Loop index i is held in register %rdx, and the address of data is held in register 
%rax. As before, the accumulated value acc is held in vector register %xmm0. The 
loop unrolling leads to two vmu1sd instructions—one to add data[i] to acc, and 


2. The acc optimizer operates by generating multiple variants of 4 function and then choosing one that 
it predicts will yield the best performance and smallest code size. A$ a consequence, smail changes in 
the source code can yield widely varying forms of machine code. We7have found that the choice of 
pointer-based or array-based code has no impact on the performance of programs running on our 
reference machine. 





Figure 5.19 


Fist [seis | es neo 


vmulsd (Wrax,Ardx,8), AxmmO, 4xmn0 


vmulsd 8(%rax,%rdx,8), %xmmd, 4xmm0 


addq $2,%4rdx 
cmpq %rdx,%rbp 


jg loop 


Figure 5.18 Graphical representation of inner-loop code for combine5. Each 
iteration has two vmulsd instructions, each of which is translated into a load and a 
mul operation. 


Abstracting combined 
operations as a data- 
flow graph. We rearrange, 
simplify, and abstract the 
representation of Figure 
5.18 to show the data 
dependencies between 


successive iterations 


(a). We see that each 
iteration must perform 
two multiplications in 


sequence (b). 


the second to add data[i+1] to acc. Figure 5.18 shows a graphical representation 
of this code. The vmulsd instructions each get translated into two operations: 3 
one to load an array element from memory and one to multiply this value by j 
the accumulated value. We see here that register 4xmmO gets read and written 
twice in each execution of the loop. We can rearrange, simplify, and abstract 
this graph, following the process shown in Figure 5.19(a), to obtain the template 
shown in Figure 5.19(b). We then replicate this template, n /2 times to show the 
computation for a vector of length n, obtaining the data-flow representation] 
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Figure 5.20 Critical path 
Data-flow representation 

of combine5 operating 

on a vector of length 

n. Even though the loop 

has been unrolled by a 

factor of 2, there are still 7 

mul operations along the 

critical path. 


shown in Figure 5.20. We see here that there is still a critical path of » mul 
operations in this graph—there are half as many iterations, but each iteration has 
two multiplication operations in sequence. Since the critical path was the limiting 
factor for the performance of the code without loop unrolling, it remains so with 
k x Lloop unrolling. 


P 


Aside Getting the’ 'Compilér to unfoll loops? 


Ei 
«Loop oling. can easily be performed by a compiler, Many compilers c do this as part of their collection 


of optimizations. gcc,will perform some-forms of loop ynrolling when inveked with optimization level 3 
or higher: roa o Py 


ate 
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5.9 Enhancing Parallelism 


At this point, our functions have hit the bounds imposed by the latencies of the 
arithmetic units. As we have noted, however, the functional units performing ad- 
dition and multiplication are all fully pipelined, meaning that they can start new 
operations every clock cycle, and some of the operations can be performed by 
multiple functional units. The hardware has the potential to perform multiplica- 
tions and additions at a much higher rate, but our code cannot take advantage of 
this capability, even with loop unrolling, since we are accumulating the value as a 
single variable acc. We cannot compute a new value for acc until the preceding 
computation has completed. Even though the functional unit computing a new 
value for acc can start a new operation every clock cycle, it will only start one 
every L cycles, where L is the latency of the combining operation. We will now 
investigate ways to break this sequential dependency and get performance better 
than the latency bound. 


5.9.1 Multiple Accumulators 


For a combining operation that is associative and commutative, such as integer 
addition or multiplication, we can improve performance by splitting the set of 
combining operations into two Or more parts and combining the results at the 
end. For example, let P, denote the product of elements ag, 41, - +» >» MEM 


n—-1 
Pa = i a; 
i=0 


Assuming n is even, we can also write this as P, = PE, x PO,, where PE, is the 


product of the elements with even indices, and PO, is the product of the elements 
with odd indices: 


Figure 5.21 shows code that uses this method. It uses both two-way loop $ 
unrolling, to combine more elements per iteration, and two-way parallelism, 
accumulating elements with even indices in variable acc0 and elements with odd 
indices in variable acci. We therefore refer to this as "2 x 2 loop unrolling.” As | 
before, we include a second loop to accumulate any remaining array elements for ; 
the case where the vector length is nota multiple of 2. We then apply the combining 
operation to accO and acci to compute the final result. 

Comparing loop unrolling alone to loop unrolling with two-way parallelism, & 
we obtain the following performance: | 
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1 /* 2 x 2 loop unrolling */ 

2 void combine6(vec ptr v, data t *dest) 
3 ( 

4 long i; 

5 long length = vec length(v); 

6 long limit - length-1; z 
7 data_t *data = get_vec_start(v); 

8 data t accO = IDENT; 

9 data t acci = IDENT; 

10 à 

11 /* Combine 2 elements at a time */ 
12 for (i = 0; i < limit; i+=2) { 

13 acc0 = acc0 OP data[i]; 

14 acci = acci OP data[i+1]; 

15 } 

16 

17 /* Finish any remaining eleménts */ 
18 for (; i < length; i++) { 

19 accO = accO OP data[i]; 

20 } 

21 *dest = acc0 OP acci; 

22 } 


Figure 5.21 Applying 2 x 2 loop unrolling. By maintaining multiple accumulators, 
this approach can make better use of the multiple functional units and their pipelining 
capabilities. 


Integer Floating point 
Function Page Method + * + * 


eS 
combine4 515 Accumulate in temporary 1.27 3.01 3.01 5.01 


combine5 532 2 x 1 unrolling 1.01 3.01 3.01 5.01 
conbine6 537  2x2unrolling ' 0.81 1.51 1.51 2.51 
Latency bound 1.00 3.00 3.00 5.00 
Throughput bound 0.50 1.00 1.00 0.50 


We see that we have improved the performance for all cases, with integer 
product, floating-point addition, and floating-point multiplication improving by 
a factor of around 2, and integer addition improving somewhat as well. Most 
significantly, we have broken through .the barrier imposed by the latency bound. 
The processor no longer needs to delay the start of one-sum or product operation 
until the.previous one has completed. ` 

To understand the performance:of combine6, we start with the code and 
operation sequence shown in Figure 5.22. We can derive a template showing the 





vmulsd (Yrax,%rdx,8), %xmm0, %xmm0 


vmulsd 8(%rax,%xrdx,8), %xmmi, %xmmi 


addq $2,%rdx 
empq %rdx,4rbp 


jg loop 


Figure 5.22 Graphical representation of inner-loop code for combine6. Each iteration has two vinulsd 
instructions, each of which is translated into a load and a mul operation. 


(b) 


Figure 5.23 Abstracting combine6 operations as a data-flow graph. We rearrange, simplify, and abstract 
the representation of Figure 5.22 to show the data dependencies between successive iterations (a). We see 
that there is no dependency between the two mul operations (b). 


data dependencies between iterations through the process shown in Figure 523. aR 
As with combine5, the inner loop contains two vmulsd operations, but these B 
instructions translate into mul operations that read and write separate registers, : [a 
with no data dependency between them (Figure 5.23(b)). We then replicate this 4 à 
template n/2 times (Figure 5.24), modeling the execution of the function on a 
vector of length n. We see that we now have two critical paths, one corresponding 1 

to computing the product of even-numbered elements (program value accO) and | 
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Figure 5.24 Critical paths 
Data-flow representation 

of combine6 operating 

on a vector of length x. 

We now have two critical 

paths, each containing n/2 | 

operations. 


data[1] 


;data[3] 


è 


data[n-1] 


; 
E 


one for the odd-numbered elements (program value acc1). Each of these critical 

, paths contains only n/2 operations, thus leading to a CPE of around 5.00/2 — 2.50. 

. A similar analysis explains our observed CPE of around L/2 for operations with 

latency L for the different combinations of data type and combining operation. 

Operationally, the programs are exploiting the capabilities of the functional units 

t to increase their utilization by a factor of 2. The only exception is for integer 

| addition. We have reduced the CPE to below 1.0, but there is still too much loop 
|. overhead to achieve the theoretical limit of 0.50. 

We can generalize the multiple accumulator transformation to unroll the loop 

by a factor of k and accumulate k values in parallel, yielding k x k loop unrolling. 

| Figure 5.25 demonstrates the effect of applying this transformation for values 

E up to k — 10. We can see that, for sufficiently large values of k, thé program can 
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Figure 5.25 CPE performance of k x k Joop unrolling. All of the CPEs improve with 
this transformation, achieving near or at their throughput bounds. 


achieve nearly the throughput bounds for all cases. Integer addition achieves a 
CPE of 0.54 with k = 7, close to the throughput bound of 0.50 caused by the two 
load units. Integer multiplication and floating-point addition achieve CPEs of 1.01 
when k > 3, approaching the throughput bound of 1.00 set by their functional units. 
Floating-point multiplication achieves a CPE of 0.51 for k > 10, approaching the 
throughput bound of 0.50 set by the two floating-point multipliers and the two 
load units. It is worth noting that our code is able to achieve nearly twice the 
throughput with floating-point multiplication as it can with floating-point addition, 
even though multiplication is a more complex operation. 

In general, a program can achieve the throughput bound for an operation 
only when it can keep the pipelines filled for all of the functional units capable of 
performing that operation. For an operation with latency L and capacity C, this 
requires an unrolling factor k > C - L. For example, floating-point multiplication 
has C =2 and L = 5, necessitating an unrolling factor of k > 10. Floating-point 
addition has C = 1 and L =3, achieving maximum throughput with k > 3. 

In performing the k x k unrolling transformation, we must consider whether it 
preserves the functionality of the original function. We have seen in Chapter 2 that 
two’s-complement arithmetic is commutative and associative, even when overflow 
occurs. Hence, for an integer data type, the result computed by combine6 will be 
identical to that computed by combined under all possible conditions. Thus, an 
optimizing compiler could potentially convert the code shown in combined first 
to a two-way unrolled variant of combined by loop unrolling, and then to that 
of combine6 by introducing parallelism. Some compilers do either this or similar 
transformations to improve performance for integer data. 

On the other hand, floating-point multiplication and addition are not as- 
sociative. Thus, combined and combine6 could produce different results due to 
rounding or overflow. Imagine, for examplé, a product computation in which all 
of the elements with even indices are numbers with very large absolute values, 
while those with odd indices are very close to 0.0. In such a case, product PE, 
might overflow, or PO, might underflow, even though computing product P, pro- | 





Section 5.9 Enhancing Parallelism 541 


ceeds normally. In most real-life applications, however, such patterns are unlikely. 
Since most physical phenomena are continuous, numerical data tend to be reason- 
ably smooth and weil behaved, Even when there are discontinuities, they do not 
generally cause periodic patterns that lead to a condition such as that sketched ear- 
lier. It is unlikely that multiplying the elements in strict order gives fundamentally 
better accuracy than does multiplying two groups independently and then mul- 
tiplying those products together. For most applications, achieving a performance 
gain of 2x outweighs the risk of generating different results for strange data pat- 
terns. Nevertheless, a program developer should check with potential users to see 
if there ate particular conditions that may cause the revised algorithm to be unac- 
ceptable. Most compilers do not attempt such transformations with floating-point 
code, since they have no way to judge the risks of introducing transformations that 
can change the program behavior, no matter how small. 


5.9.2 Reassociation Transformation 


We now explore another way to break the sequential dependencies and thereby 
improve performance beyond the latency bound. We saw that the k x 1 loop un- 
rolling of combined did not change the set of operations performed in combining 
the vector elements to form their sum or product. By a very small change in the 
code, however, we can fundamentally change the way the combining is performed, 
and also greatly increase the program performance. 

Figure 5.26 shows a function combine? that differs from the unrolled code of 
combined (Figure 5.16) only in the way the elements are combined in the inner 
loop. In combine, the combining is performed by the statement 


12 ace = (acc OP data(il) OP datafi+i]; 
while in combine? it is performed by the statement 
12 acc = acc OP (data[i] OP data[i+1]); 


differing only in how two parentheses are placed. We call this a reassociation trans- 
formation, because the parentheses shift the order in which the vector elements 
are combined with the accumulated value acc, yielding a form of loop unrolling 
we refer to as “2 x 1a.” 

To an untrained eye, the two statements may seem essentially the same, but 
when we measure the CPE, we get a surprising result: 


Integer Floating point 





| Function Page Method + * + * 


conbine4 515 Accumulate in temporary 1.27 3.01 3.01 5.01 
f combineS 532 2x 1unrolling 1.01 3.01 3.01 5.01 
f combine6 537 2x 2unrolling 0.81 1.51 1.51 2.51 
[ combine7 — 542 2x la unrolling 1.01 1.51 1.51 2:51 


E Latency bound 100 300 300 5,00 
. Throughput bound 0.50 1.00 1.00 0.50 
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/* 2 x 1a loop unrolling */ 


1 

2 void combine7(vec ptr v, data t *dest) 

3. 

4 long i; 

5 long length = vec length(v); i 
6 long limit = length-1; 

7 data_t *data = get_vec_start(v); 

8 data_t acc = IDENT; 

9 

10 /* Combine 2 elements at a time */ 

11 for (i = 0; i < limit; i+=2) { 

12 acc = acc OP (data[i] OP data[i+1]); 
13 } 

14 

15 /* Finish any remaining elements */ 

16 for (; i « length; i++) { 

17 acc = acc OP 'data[il; 

18 } 

19 *dest = acc; 

20 } 


Figure 5.26 Applying 2 x 1a unrolling. By reassociating the arithmetic, this approach 
increases the number of operations that can be performed in parallel. 










The integer addition case matches the performance of k x1 unrolling 
(combine5), while the other three cases match the performance of the versions 
with parallel accumulators (combine6), doubling the performance relative tok x 1 
unrolling. These cases have broken through the barrier imposed by the latency 
bound. . 

Figure 5.27 illustrates how the code for the inner loop of combine (for the 
case of multiplication as the,combining operation and double as data type) gets 
decoded into operations and the resulting data dependencies. We see that the load 
operations resulting from the vnovsd and the first vmulsd instructions load vector 
elements i and i -- 1 from memory, and the first mul operation multiplies them 
together. The second mul operation then multiples this result by the accumulated 
value acc. Figure 5.28(a) shows how we rearrange, refine, and abstract the op- 
erations of'Figure 5.27 to get a template representing the data dependencies for 
one iteration (Figure 5.28(b)). As with the templates for cémbine5 and combine’, 
we have two load and two mul operations, but only one of the mul operations 
forms a data-dependency chain between loop registers. When we then replicate 
this template n/2 times to show the computations performed in multiplying n vec- 
tor elements (Figure 5.29), we see that we only have n/2' operations along the 
critical path. The first multiplication within each iteration can be performed with- 
out waiting for the accumulated value from the previous iteration. Thus, we reduce 
the minimum possible CPE by a factor of around 2. 
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vmovsd (%rax,%rdx,8), %xmmO 
vmulsd 8(%rax,%rdx,8), %xmm0, Yxmmo 


vmulsd %xmm0, %xmm1, %xmmi 
addq $2,%rdx 
cmpq %rdx,%rbp 


jg loop 


Figure 5.27 Graphical representation of inner-loop code for combine7. Each 
iteration gets decoded into similar operations as for combine or combine6, but with 
different data dependencies. 


Figure 5.28 Abstracting combine? operations as a data-flow graph. We rearrange, 
simplify, and abstract the representation of Figure 5.27 to show the data dependencies 
between successive iterations. The upper mul operation multiplies two 2-vector elements 
with each other, while the lower one multiplies the result by loop variable acc. 
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Figure 5.29 Critical path 
Data-flow representation SA 
of combine7 operating 

on a vector of length n. data[0] 

We have a single critical 

path, but it contains only 

n/2 operations. 


idata[n-2] 
è 
1 


fdata[n-1] 


Figure 5.30 demonstrates the effect of applying the reasseciation transforma- 
tion to achieve what we refer to as k x 1a loop unrolling for values up to k=10. i 
We can see that this transformation yields performance results similar to what is 
achieved by maintaining K separate accumulators with k x k unrolling, In all cases, | 
we come close to the throughput bounds imposed by the functional units. 

In performing the reassociation transformation, we once again change the 4 
order in which the vector elements will be combined together. For integer addition 
and multiplication, the fact that these operations are associative implies that | 
this reordering will have no effect on the result. For the floating-point cases, we 
must once again assess whether this reassociation is likely to significantly affect 
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Figure 5.30 CPE performance for k x 1a loop unrolling. All of the CPEs improve with 
this transformation, nearly approaching their throughput bounds. 


the outcome. We would argue that the difference would be immaterial for most 
applications. 

In summary, a reassociation transformation can reduce the number of opera- 
E tions along the critical path in a computation, resulting in better performance by 
E. better utilizing the multiple functional units and their pipelining capabilities. Most 
€ compilers will not attempt any reassociations of floating-point operations, since 
these operations are not guaranteed to be associative. Current versions of ccc do 
perform reassociations of integer operations, but not always with good effects. In 
general, we have found that unrolling a loop and accumulating multiple values in 
parallel is a more reliable way to achieve improved program performance. 














B. Consider the followi 
= precision numbers, We have unrolled the loop by a factor of 3. 


double aprod(double al], long n) 








» 


long i; 
double x, y, z; 

a double r = 1; 

E for li = 0; i < n-2; it= 3) { 

"3 x = alil; y = afiti]; z = afi«21; 


a r=r*x * y * z; * Product computation */ 


} 

for (; i < n; i++) 
r *= a[i]; 

return r; 
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For the line labeled “Product computation," we can use parentheses to cre- 
ate five different associations of the computation, as follows: 


1l 


(Gc * x) * y) * z; /* M */ 
(r* (x * y)) * z; /* A2 */ 
r» ((x*y)* z); /* A8 */ 
r* (x * (y * z)); /* AG */ 
(r * x) * (y * z); /* AB */ 


u 


H HW AR 
ll 


Assume we run these functions on a machine where floating-point multiplication 
has a latency of 5 clock cycles. Determine the lower bound on the CPE set by 
the data dependencies of the multiplication. (Hint: It helps to draw a data-flow 
representation of how r is computed on every iteration.) 


Web Aside OPT:SIMD Achieving greater parallelism with vector instructions 


As described in Section3.1, Intel introduced the SSE instructions in 1999, where SSE is the acronym for 
“streaming SIMD exténsiéns” and, in turn, SIMD (pronounced “gim-dee”) is the acronym for ‘ ‘single 
instruction, multiple data.” ‘The SSE capability has gone through multiple generations, with ‘more 
recent versions being named advanced vector extensions, or'AVX. The SIMD éxecution model involves 
operating on entire vectors of data Within single instructions. These véétors are held in a special set of 
vector registers, named Aymm0-Zymm15. Current AVX vector registers are 32 bytes long, and therefore 
each can hold eight 32-bit numbers or four 64-bit numbérs, where the numbers can be either integer 
or floating-point values. AVX instructions can then perform vector operations on these registers, stich 
as adding or multiplying eight or four sets of values in parallel. For example, if YMM register ymmo 
contains eight single-precision floating-point numbers, which we denote ap, ..., a7, and rex contains 
the memory address of a sequence of eight single-precision floating-point numbers, which we denote 


bg, ..., b7, then the instruction 
vmulps (Arcs), AymmO, Aymmi 


will read the eight values from memory and perform eight multiplications in parallel, computing 
a; < a-b; for 0 x i <7 and storing the resulting eight products in vector register Xymmi. We see 
that a single instruction is able to generate a computation over multiple data values, hence the term 
“SIMD.” 

GCC supports extensions to the C language that let programmers express a program in,terms of 
vector operations that can be compiled into the vector instructions of AVX (as well as code based 
on the earlier SSE instructions). This coding style is preferable to writing code directly in assembly 
language, since Gcc can also generate code for the vector instructions found on other processors. 

Using a combination of acc instructions, loop unrolling, and multiple accumulators, we are able to 
achieve the following performance for our combining functions: 
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Web Aside: OPT:SIMD Achieving. greater. parallelism with vector instructions (continued) 
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1.00 101 ;05L 1.01 0.52 


Scalar 10 x 10, <a 054 .101 0.55 

Scalar throughput bound 0.50 0:50 1.00. 4.00 100 4.00 0.50 0.50 
Vector 8 x 8 005 024 013 “15t" 012 os 025 016 
Vector throughput bond — 0.00 0.12 012 — 012 006 025 042 


In this chart?the first set of numbers is for conventional, scalar code written in the style of combine6, 
unrolling by a factor of 10 and maintaining 10 accumulators. The second set of numbers is for code 


written imà form that acc can compile into AVX"vector còde. In addition to using vector operations, 
this version uhrolls ihe'main loop bya factor of 8 arid maintains eight sepáraté veciór accumulators, We 
show results for both 32:bit and 64-bifnünibers;since the véctor instructions achieve 8-way parallelism 
in the first case, But duly 4-way parallelism in the second. ' Ted 

We can see that the vector code achieves almost an eightfold improvement on the four 32-bit cases, 
and d iourfold improvéineiit on three of thé four 6£bil cases. Only the long integer multiplication code 
does nót perform well when we atfémpt'to express it in ectdf còde. The AVX instruction set does not 
include o£ to do. parallel multiplicatión of B4 bit'integers,.ani so GCC cannot generate vector code 
for'this case. Using veétor'instrüttións creates a'ríew (broughpti bound fór the combining operations. 
These are'eight ‘times lower'for 32-bit'operations and four times lotrer for 64-bit Óperations than the 
scalar limits. Our code comes close to achieving these bounds for several combiriations of data type 
and operation. Í 


Be 
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5.10 Summary of Results for Optimizing Combining Code 


Our efforts at maximizing the performance of a routine that adds or multiplies the 
elements of a vector have clearly paid off. The following summarizes the results 
we obtain with scalar code, not making use of the vector parallelism provided by 
AVX vector instructions: 


Integer Floating point 
Function Page Method + * + * 
combinel 507 Abstract -01 10.12 10.12 10.17 11.14 
combine6 537 2 x 2 unrolling 0.81 1.51 1.51 2.51 


10 x 10 unrolling 0.55 1.00 1.01 0.52 


Latency bound 1:00 3.00 3.00 5.00 
Throughput bound 0.50 1.00 1.00 0.50 
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By using multiple optimizations, we have been able to achieve CPEs close to 
the throughput bounds of 0.50 and 1.00, limited only by the capacities of the func- 
tional units. These represent 10—20x improvements on the original code. This has 
all been done using ordinary C code and a standard compiler. Rewriting the code 
to take advantage of the newer SIMD instructions yields additional performance 
gains of nearly 4x or8x. For example, for single-precision multiplication, the CPE 
drops from the original value of 11.14 down to 0.06, an overall performance gain 
of over 180x. This example demonstrates that modern processors have consid- 
erable amounts of computing power, but we may need to coax this power out of 
them by writing our programs in very stylized ways. 


Li mmn o ——R——— 


5.11 Some Limiting Factors 


== mes 


We have seen that the critical path in a data-flow graph representation of a 
program indicates a fundamental lower bound on the time required to execute a 
program. That is, if there is some chain of data dependencies in a program where 
the sum of all of the latencies along that chain equals 7, then the program will 
require at least T cycles to execute. 

We have also seen that the throughput bounds of the functional units also 
impose a lower bound on the execution time for a program. That is, assume 
that a program requires a total of N computations of some operation, that the 
microprocessor has C functional units capable of performing that operation, and 
that these units have an issue time of 7. Then the program will require at least 
N - I/C cycles to execute. 

In this section, we will consider some other factors that limit the performance 
of programs on actual machines. 


5.11.1 Register Spilling 


The benefits of loop parallelism are limited by the ability to express the compu- 
tation in assembly code. If a program has a degree of parallelism P that exceeds 
the number of available registers, then the compiler will resort to spilling, stot- 
ing some of the temporary values in memory, typically by allocating space on the 
run-time stack. As an example, the following measurements compare the result 
of extending the multiple accumulator scheme of combine6 to the cases of k = 10 | 
and k = 20: 


Integer Floating point 
Function Page Method * * 


combine6 537 
10 x 10 unrolling 0.55 1.00 
20 x 20 unrolling 0.83 1.03 


Throughput bound 0.50 1,00 
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We can see that none of the CPEs improve with this increased unrolling, and 
some even get worse. Modern x86-64 processors have 16 integer registers and can 
make use of the 16 YMM registers to store floating-point data. Oncethe number 
of loop variables exceeds the number of available registers, the program must 
allocate some on the stack. 

As an example, the following snippet of code shows how accumulator acco is 
updated in the inner loop of the code with 10 x 10 unrolling: 


i 
Updating of accumulator accO in 10 x 10 urolling 
vmulsd (irdx), %xmmO0, %xmmO acc0, *= data[i] 


We can see that the accumulator is kept in register 4xmmO, and so the program can 
simply read data[i] from memory and multiply it by this register. 

The comparable part of the code for 20 x 20 unrolling has a much different 
form: 


Updating of accumulator accO in 20 x 20 unrolling 
vmovsd 40(%rsp), ^xmpO , , 

vmulsd. (%rdx), ZxmmO, %xmm0 

vmovsd %xmm0, 40(4rsp) 


The accumulator is kept as a local variable on the stack, at offset 40 from the 
stack pointer. The program must read both its value and the value of data[i] 
t from memory, multiply them, and store the result back to memory. 
1 Once a compiler must resort to register spilling, any advantage of maintaining 
multiple accumulators will most.]ikely be lost. Fortunately, x86-64 has enough 
registers that most loops will become throughput limited before this occurs. 


5.11.2 Branch Prediction and Misprediction Penalties 


ja We demonstrated via experiments'in Section 3.6.6 that a conditional branch can 
+ incur a significant misprediction penalty when the branch prediction logic does 
| not correctly anticipate whether or not a branch will be taken. Now that we have 
| learned something about how processors operate, we can understand where this 
penalty arises. 

Modern processors work well ahead of the currently executing instructions, 
reading new instructions from memory and decoding them to determine what 
| operations to perform on what operands. This instruction pipelining works well as 
longas the instructions follow in a simple sequence. When a branch is encountered, 
the processor must guess which way the branch will go. For the case of a conditional 
| jump, this means predicting whether or not the branch will be taken. For an 
instruction such'as an indirect jump (as.we saw.in the code to jump to an address 
specified by a jump table entry) or a procedure return, this means predicting the 

| target address. In this discussion, we focus.on conditional branches. 
| In a.processor that employs speculative execution, the processor begins exe- 
| ' cuting the instructions at the predicted branch target. It does this in a way that 
avoids modifying any actual register or memory locations. until the actual out- 
come has been determined. If the prediction js correct, the’ processor can then 
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“commit” the results of the speculatively executed instructions by storing them in 
registers or memory. If the prediction is incorrect, the processor must discard all 
of the speculatively executed results and restart the instruction fetch process at 
the correct location. The misprediction penalty is incurred in doing this, because 
the instruction pipeline must be refilled before useful results are generated. 

We saw in Section 3.6.6 that recent versions of x86 processors, including all 
processors capable of executing x86-64 programs, have conditional move instruc- 
tions. Gcc can generate code that uses these instructions when compiling condi- 
tional statements and expressions, rather than the more traditional realizations 
based on conditional transfers of control. The basic idea for translating into con- 
ditional moves is to compute the values along both branches of a conditional 
expression or statement and then use conditional moves to select the desired value. 
We saw in Section 4.5.7 that conditional move instructions can be implemented 
as part of the pipelined processing of ordinary instructions. There is no need to 
guess whether or not the condition will hold, and hence no penalty for guessing 
incorrectly. 

How, then, can a C programmer make sure that branch misprediction penal- 
ties do not hamper a program’s efficiency? Given the 19-cycle misprediction 
penalty we measured for the reference machine, the stakes are very high. There 
is no simple answer to this question, but the following general principles apply. 


Do Not Be Overly Concerned about Predictable Branches 


We have seen that the effect of a mispredicted branch can be very high; but that 
does not mean that all program branches will slow a program down. In fact, the 
branch prediction logic found in modern processors is very good at discerning 
regular patterns and long-term trends for the different branch instructions. For 
example, the loop-closing branches in our combining routines would typically be 
predicted as being taken, and hence would only incur a misprediction penalty on 
the last time around. 

As another example, consider the results we observed when shifting from 
combine2 to combine3, when we took the function get_vec_element out of the 
inner loop of the function, as is reproduced below: 


Integer Floating point 


Function Page Method + * * * 


runcuon — Ae ee i i 
combine2 509 Move vec. length 7.02 9.03 9.02 11.03 
combine3 513 Direct data access 7.7 9.02 9.02 11.03 





The CPE did not improve, even though the transformation eliminated two condi- | 
tionals on each iteration that check whether the vector index is within bounds. For 4 
this function, the checks always succeed, and hence they are highly predictable. 
As a way to measure the performance impact of bounds checking, consider 
the following combining code, where we have modified the inner loop of combined 
by replacing the access to the data element with the result of performing an 
inline substitution of the code for get. vec. element. We will call this new version 4 
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combine4b. This code performs bounds checking and also references the vector 
elements through the vector data structure. 


/* Include bounds check in loop */ 
void combine4b(vec, ptr y, data, t *dest) 
i 

long i; 

long length - vec length(v); 

data t acc = IDENT; 


for (i 0; i < length; i++) ( 
if (i >= 0 && i < v-»len) tf 
acc = acc OP v-»data[i]; 


O ON DA tà BW MN = 


} 
} 
*dest = acc; 


} 


We can then directly compare the CPE for the functions with and without bounds 
checking: 


Integer Floating point 
Function Page Method + * + * 





combine4 515 Noboundschecking ^ 127 3.01 3.01 5.01 
combine4b 515 Bounds checking 2.02 3.01 3.01 5.01 


The version with bounds checking is slightly slower for the case of integer addition, 
but it achieves the same performance for the other three cases, The performance 
of these cases is limited by the latencies of their respective combining operations. 
The additional computation required to perform bounds checking can take place 
in parallel with the combining operations. The processor is able to predict the 
outcomes of these branches, and so none of this evaluation has much effect on 
the fetching and processing of the instructions that form the critical path in the 
program execution. 


Write Code Suitable for Implementation with Conditional Moves 


Branch prediction is only reliable for regular patterns. Many tests in a program 

are completely unpredictable, dependent on arbitrary features of the data, such 

as whether a number is negative or positive. For these, the branch prediction logic 

will do very poorly. For inherently unpredictable cases, program performance can 

be greatly enhanced if the compiler is able to generate code using conditional 
data transfers rather than conditional control transfers. This cannot be controlled 

E directly by the C.programmer, but some ways of expressing conditional behavior 

B. can be more directly translated into conditional moves than others. 

We have found that ccc is able to generate conditional moves for code written 

, ina more “functional” style, where we use conditional operations to compute 
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values and then update the program state with these values, as opposed to a more 
“imperative” style, where we use conditionals to selectively update program state. 

There are no strict rules for these two styles, and so we illustrate with an 
example. Suppose we are given two arrays of integers a and b, and at each position 
i, we want to set a [i] to the minimum of a [i] and b [i], and b Li] to the maximum. 


An imperative style of implementing this function is to check at each position 
i and swap the two elements if they are out of order: 


/* Rearrange two vectors so that for each i, pli] >= ali] */ 
void minmaxi(long af], long bl], long n) 1 
long i; 
for (i = 0; i <n; i++) { 
if (ali) > bil) f 
long t = afi]; 
a[i] = b[i]; 
bla] = t; 


à w N ~> 


- ov aw OQ t 


= = 


} 


Our measurements for this function show a CPE of around 13.5 for random data 
and2.5-3.5 for predictable data, an indication of a misprediction penalty of around 


20 cycles. 
A functional style of implementing this function is to compute the minimum 
and maximum values at each position i and then assign these values to a[i] and 


b [i] , respectively: 


1 /* Rearrange two vectors 50 that for each i, bli] >= ali] y 
2 void minmax2(long af], long bÍ], long n) {í 
3 long i; 
4 for (i = 0; i < n; i++) { 
long min = a[i] < b[i] ? afi] : b[i]; 
long max = ali] < bli] ? bli] : ali]; 


5 

6 

7 ali] = min; 
8 b[i] = max; 
9 

0 


1 } 


Our measurements for this function show a CPE of around 4.0 regardless of 
whether the data are arbitrary or predictable. (We also examined the generated 
assembly code to make sure that it indeed uses conditional moves.) 

As discussed in Section 3.6.6, not all conditional behavior can be implemented 
with conditional data transfers, and so there are inevitably;cases where program- : 
mers cannot avoid writing code that will lead to conditional branches for which 
the processor will do poorly with its branch prediction. But, as we have shown, a | 
little cleverness on the part of the programmer can sometimes make code more | 
amenable to translation into conditional data transfers. This requires some amount 
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of experimentation, writing different versions of the function and then examining 
the generatéd assembly code and measuring performance. 


loops [98]: 


1 void merge(long srci[], long src2(l, long dest(], long n) { 
2 longiii = 0; 
3 ^2 long i2 0; 
4 long id - 0; 
5 while (i1.< n && i2 < n) { 
6 if (srci(ii1] < src2(i21) 
71 dest [idt+] = srct(itt++]; 
8 else 
9 dest [id+t+] = src2[i2++]; 
} 
while (i1 < n) 
dest [idt+] = srei[ii++]; 
while (i2 « n) 
dest [id++] src2[i244)]; 
} 


The branches caused by comparing variables i1 and i2 to n have good prediction 
performance—the only mispredictions occur when they first become false. The 
comparison between values srcí[ii] and src2[i2] (line 6), on the other hand, 
is highly unpredictable for typical data. This comparison controls.a conditional 
branch, yielding a CPE (where the number of elements is 2n) of around 15.0 when 
run on random data. 

Rewrite the code so that the effect of the conditional statement in the first 
loop (lines 6-9) can be implemented with a conditional move. 


5.12 Understanding Memory Performance 


All of the code we have written thus far, and all the tests we have run, access 
relatively small amounts of memory. For example, the combining routines were 
measured over vectors of length less than 1,000 elements, requiring no more than 
8,000 bytes of data. All modern processors contain one or more cache memories 
to provide fast access to such small amounts of memory. In this section, we will 
further investigate the performance of programs that involve load (reading from 
memory ito registers) and store (writing from registers to memory) operations, 
considering only the cases where all data are held in cache. In Chapter 6, we go 
into much more detail about how caches work, their performance characteristics, 
and how to write code that makes best use of caches. 
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As Figure 5.11 shows, modern processors have dedicated functional unjts to 
perform load and store operations, and these units have internal buffers to hold 
sets of outstanding requests for memory operations. For example, our reference 
machine has two load units, each of which can hold up to 72 pending read requests. 
It has a single store unit with a store buffer containing up to 42 write requests. Each 
of these units can initiate 1 operation every clock cycle. 


5.12.1 Load Performance 


The performance of a program containing load operations depends on both the 
pipelining capability and the latency of the load unit. In our experiments with 
combining operations using our reference machine, we saw that the CPE never 
got below 0.50 for any combination of data type and combining operation, except 
when using SIMD operations. One factor limiting the CPE for our examples is 
that they all require reading one value from memory for each element computed. 
With two load units, each able to initiate at most 1 load operation every clock 
cycle, the CPE cannot be less than 0.50. For applications where we must load k 
values for every element computed, we can never achieve a CPE lower than k/2 
(see, for example, Problem 5.15). 

In our examples so far, we have not seen any performance effects due to the 
latency of load operations. The addresses for our load operations depended only 
on the loop index i, and so the load operations did not form part of a performance- 
limiting critical path. 

To determine the latency of the load operation on à machine, we can set up 
a computation with a sequence of load operations, where the outcome of one 
determines the address for the next. As an example, consider the function list. 
len in Figure 5.31, which computes the length of a linked list. In the loop of this 
function, each successive value of variable 1s depends on the value read by the 
pointer reference 1s->next. Our measurements show that function list, len has 


typedef struct ELE 1 
struct ELE *next; 
long data; 

) list. ele, *list ptr; 


long list len(list ptr is) { 
long len = 0; 
while (1s) f 
Len++ ; 
ls = ls- next; 
} 


return len; 


1 
2 
3 
4 
5 
6 
7 
8 


} 


Figure 5.31 Linked Jist function. Its performance is limited by the latency of the load 
operation. 
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a CPE of 4.00, which we claim is a direct indication of the latency of the load 
operation. To see this, consider the assembly code for the loop: 


Inner loop of list_len 
ls in frdi, len in frax 
.L3: loop: 
addq $1, rax Increment len 
movq (žrdi), 4rdi ls = 1s->next 
testq žrdi, %rdi Test ls 
jne .L3 If nonnull, goto loop 


The movq instruction on line 3 forms the critical bottleneck in this loop. Each 
successive value of register žrdi depends on the result of a load operation having 
the value in 4rdi as its address. Thus, the load operation for one iteration cannot 
begin until the one for the previous iteration has completed. The CPE of 4.00 
for this function is determined by the latency of the load operation. Indeed, this 
measurement matches the documented access time of 4 cycles for the reference 
machine's L1 cache, as is discussed in Section 6.4. 


5.12.2 Store Performance 


In all of our examples thus far, we analyzed only functions that reference mem- 
ory mostly with load operations, reading from a memory location into a register. 
Its counterpart, the store operation, writes a register value to memory. The per- 
formance of this operation, particularly in relation to its interactions with load 
operations, involves several subtle issues. 

As with the load operation, in most cases, the store operation can operate in a 
fully pipelined mode, beginning a new store on every cycle. For example, consider 
the function shown in Figure 5.32 that sets the elements of an array dest of length 
n to zero. Our measurements show a CPE of 1.0. This is the best we can achieve 
on a machine with a single store functional unit. 

Unlike the other operations we have considered so far, the store operation 
does not affect any register values. Thus, by their very nature, a series of store 
operations cannot create a data dependency. Only a load operation is affected by 
the result of a store operation, since only a load can read back the memory value 
that has been written by the store. The function write read shown in Figure 5.33 


1  /* Set elements of array to 0 */ 

2 void clear, array(long *dest, long n) { 
3 long i; 

4 for (i = 0; i < n; i++) 

5 dest [i] = 0; 

6 4} 


Figure 5.32 Function to set array elements to 0. This code achieves a CPE of 1.0. 
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1 /* Write to dest, read from src */ ” 
2 void write read(long *src, long *dst, long n) 
3 t 
4 long 'cnt = n; 
5 long val = 0; 
6 
7 while (cnt) { 

8 *dst = val; 
9 val = (*src)*l; 

10 cnt--; 

11 } 

' 12 3 





Figure 5.33. Code to write and read memory locations, along with illustrative 
executions. This function highlights the interactions between stores and loads when 
arguments src and dest are equal. 


illustrates the potential interactions between loads and stores. This figure also 
shows two example executions of this function, when it is called for a'two-element 
array a, with initial contents —10 and 17, nd with argument cnt equal to 3. These 
executions illustrate some subtleties of the load and store operations. 

In Example A of Figure 5.33, argument src is a pointer to array clement 
a [0], while dest is a pointer to array element a [1). In this case, each load by the 
pointer reference *src will yield the value —10. Hence, after two iterations, 
the array elements will remain fixed at —10 and —9, respectively. The result 
of the read from src is not affected by the write to dest. Measuring this example 
over a larger number of iterations gives a CPE of 1.3. 

In Example B of Figure 5.33, both arguments src and dest are pointers to 
array element a[0]. In this case, each load by the pointer reference *src will 
yield the value stored by the-previous execution of the pointer reference *dest. 
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Figure 5.34 * 
Detail of load and store 

units. The store unit 

maintains a buffer of 

pending writes. The load 

unit must check its address 

with those in the store fs 
unit to detect a write/read Address 
dependency. 


"Load'unit = MT Store units. : 
u^ E Sy & 4 

«Store buffer , 

Address” Data. 


As a consequence, a series of ascending values will be stored in this location. In 
general, if function write, read is called with arguments src and dest pointing 
tothe same memory location, and with argument cnt havin g some value n > 0, the 
net effect is to set the location to n — 1. This example illustrates a phenomenon we 
will call a write/read dependency—the outcome of a memory read depends on a 
recent memory write. Our performance measurements show that Example B has 
a CPE òf 7.3. The write/read dependency causes a slowdown in the processing of 
around 6 clock cycles., 

To see how the processor can distinguish between these two cases and why 
one runs slower than the other, we must take a more detailed look at the load and 
store execution units, as shown in Figure 5.34. The store unit includes a store buffer 
containing the addresses and data of the store operations that have been issued 
to the store unit, but have not yet been completed, where completion involves 
updating the data cache. This buffer is provided so that a series of store operations 
can be executed without having to wait for each one to update the cache. When 
a load opetation occurs, it must check the entries in the store buffer for matching 
addresses. If it finds a match (meaning that any of the bytes being written have the 
same address as any of the bytes being read), it retrieves the corresponding data 
entry as the result of the load operation. 

Gcc generates the following code for the inner loop of write, read: 


Inner loop of- write read 
src in %rdi, dst in Xrsi, val in Yrax 
.L3£ loop: 
movq 4rax, (%rsi) Write val to dst 
movq (žrdi), %rax t = *src 
addq $1, %rax val = t*1 
subq $1, %rdx cnt-- 
jne .L3 If != 0, goto loop 














Figure 5.35 


Graphical representation 


of inner-loop code 


for write_read, The 

first movi instruction is 
decoded into separate 
operations to compute the 
Store address and to store 


the data to memory. 





movq rax, (4rsi) 


novq CÁArdi), rax 
addq $1,%rax 
subq $1,%rdx 


jne loop 





Figure 5.35 shows a data-flow representation of this loop code. The ínstruction 
movq Arax, (Arsi) is translated into two operations: The s, addr instruction com- 
putes the address for the store operation, creates an entry in the store buffer, and 
sets the address field for that entry. The s data operation sets the data field for the 
entry. As we will see, the fact that these two computations are performed inde- 
pendently can be important to program performance. This motivates the separate 
functional units for these operations in the reference machine. 

In addition to the data dependencies between the operations caused by the 
writing and reading of registers, the arcs on the right of the operators denote 
a set of implicit dependencies for these operations. In particular, the address 
computation ofthe s addr operation must clearly precede the s data operation. In 
addition, the load operation generated by decoding the instruction movq (rdi), 
4rax must check the addresses of any pending store operations, creating a data 
dependency between it and the s addr operation. The figure shows a dashed arc 
between the s data and load operations. This dependency is conditional: if the 
two addresses match, the load operation must wait until the s data has deposited 
its result into the store buffer, but if the two addresses differ, the two operations 
can proceed independently. 

Figure 5.36 illustrates the data dependencies between the operations for the 
inner loop of write read. In Figure 5.36(a), we have rearranged the operations 
to allow the dependencies to be seen more clearly. We have labeled the three 
dependencies involving the load and store operations for special attention. The arc 
labeled “1” represents the requirement that the store address must be computed 
before the data can be stored. The arc labeled "2" represents the need for the 
load operation to compare its address with that for any pending store operations. 
Finally, the dashed arc labeled “3” represents the conditional data dependency 
that arises when the load and store addresses match. 


Figure 5.36(b) illustrates what happens when we take away those operations 1 
that do not directly affect the flow of data from one iteration to the next. The 3 


data-flow graph shows just two chains of dependencies: the one on the left, with 
data values being stored, loaded, and incremented (only for the case of matching 
addresses); and the one on the right, decrementing variable cnt. 
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Figure 5.36 

Abstracting the 
operations for write_ 
read. We first rearrange the 
operators of Figure 5.35(a) 
and then show only those 
operations that use values 
from one iteration to 
produce new values for the 
next (b). 


We can now understand the performance characteristics of function write_ 
read. Figure 5.37 illustrates the data dependencies formed by multiple iterations of 
its inner loop. For the case of Example A in Figure 5.33, with differing source and 
destination addresses, the load and store operations can proceed independently, 
and hence the only critical path is formed by the decrementing of variable cnt, 
resulting in a CPE bound of 1.0. For the case of Example B with matching source 
and destination addresses, the data dependency between the s data and load 
instructions causes a critical path to form involving data being stored, loaded, and 
incremented. We found that these three operations in sequence require a total of 
around 7 clock cycles. 

As these two examples show, the implementation of memory operations in- 
volves many subtieties. With operations on registers, the processor can determine 
which instructions will affect which others as they are being decoded into opera- 
tions. With memory operations, on the other hand, the processor cannot predict 
which will affect which others until the load and store addresses have been com- 
puted. Efficient handling of memory operations is critical to the performance of 
many programs. The memory subsystem makes use of many optimizations, such 
as the potential parallelism when operations can proceed independently. 


ME as 


As another example of code with potential load-store interactions, consider the 
following function to copy the contents of one array to another: 


x5 "EE FO SJ AS E T GGE E EOE EEE esent, "rudi 
d solution page 9/7. n TL sp OE os brew d 


1 void copy array(long *src, long *dest, long n) 
2 4 

3 long i; 

4 for (i = 0; i < n; i++) 

5 dest [i] = src[il; 

6 
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Figure 5.37 Critical path Critical path 
Data-flow representation SS 
of function write_read. 

When the two addresses 

do not match, the only 

critical path is formed by 

the decrementing of cnt 

(Example A). When they 

do match, the chain of 

data being stored, loaded, 

and incremented forms the 

critical path (Example B). 


Example A Example B 


Suppose a is an array of length 1,000 initialized so that each elenient a[i] 
equals i. 


A. What would be the effect of the call copy array (a*1,a,999)? 
B. What would be the effect of the call copy array (a,ati,999)? 


C. Our performance measurements indicate that the call of part A has a CPE 
of 1.2 (which drops to 1.0 when the loop is unrolled by a factor of 4), while : 
the call of part B has a CPE of 5.0. To what factor do you attribute this | 
performance difference? 


. What performance would you expect for the call copy array(a,a,999)? 
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Wes: saw w that our measurements, of the Hacen sum n friction m (Figure 5.1) yield 
a CPE of 9.00 on à machine where the basic operation to be performed, floating- 
point addition, has a latency of just 3 clock cycles. Let us try to,understand why 
our function performs so poorly. 

The following is the assembly code for the inner loop of the function: 


Inner loop of psumi 
a in žrdi, i in Zrax, cnt in frdx 


1 .L5: loop: 

2 vmovss -4(%rsi,%rax,4), %xmm0 Get p[i-1] 

3 vaddss  (Ardi,Zrax,4), %xmm0, %xmm0 Add afi] 

4 vmovss %xmm0, (Arsi,%rax, 4) Store at pli] 

5 addq $1, Arax Increment i 

6 cmpq Xrdx, %rax Compare ‘i:cnt 
2j jne .L5 If !-', goto loop 


¿© Perform an analysis similar to those shown for combine3 (Figure 5.14) and for 
write. read (Figure 5.36) to diagram the data dependencies created by this loop, 
and hence, the critical path that forms as the computatign proceeds. Explain why 
the CPE is so high. 





Rewrite the code for qum (Figura 5: D. so 5 that It does not away to iepcaiediy 
retrieve the value of p [i] from memory. You do not need,to use loop unrolling. 
We measured the resulting code to have a CPE of 3.00, limited by the latency of 
floating-point addition. 


5.13 Life in the Real World: Performance Improvement 
Techniques 


Although we have only considered a limited set of applications, we can draw 
important lessons on how to write efficient code. We'have described a nümber 
of basic strategies for optimizing program performance: 


High-level design. Choose appropriáte algorithms and data structures for the 
problem at "Hand. „Be espéciálly vigilant to avoid algorithms or coding 
techniques that yield asymptotically poor performance. 


Basic coding principles. Avoid optimization blockers so that a compiler can 
generate efficient code. 
" Eliminate excessive function calls. Move computations out of loops 
when possible. Consider selective compromises of program modularity 
to gain greater efficiency. 
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= Eliminate unnecessary memory references. Introduce temporary vari- 
ables to hold intermediate results. Store a result in an array or global 
variable only when the final value has been computed. 


Low-level optimizations. Structure code to take advantage of the hardware 
capabilities. 
= Unroll loops to reduce overhead and to enable further optimizations. 
= Find ways to increase instruction-level parallelism by techniques such 
as multiple accumulators and reassociation. 
=a Rewrite conditional operations in a functional style to enable compi- 
lation via conditional data transfers. 


A final word of advice to the reader is to be vigilant to avoid introducing 
errors as you rewrite programs in the interest of efficiency. It is very easy to make 
mistakes when introducing new variables, changing loop bounds, and making the 
code more complex overall. One useful technique is touse checking code to test 
each version of a function as it is being optimized, to ensure no bugs are introduced 
during this process. Checking code applies a series of tests to the new versions of 
a function and makes sure they yield the same results as the original. The set of 
test cases must become more extensive with highly optimized code, since there 
are more cases to consider. For example, checking code that uses loop unrolling 
requires testing for many different loop bounds to make sure it handles all of the 
different possible numbers of single-step iterations required at the end. 


5.14 Identifying and Eliminating Performance Bottlenecks 


Up to this point, we have only considered optimizing small programs, where there 
is some clear place in the program that limits its performance and therefore should 
be the focus of our optimization efforts. When working with large programs, even 
knowing where to focus our optimization efforts can be difficult. In this section, 
we describe how to use code profilers, analysis tools that collect performance 
data about a program as it executes. We also discuss some general principles 
of code optimization, including the implications of Amdahl's law, introduced in 
Section 1.9.1. 


5.14.1 Program Profiling 


Program profiling involves running a version of a program in which instrumenta- 

tion code has been incorporated to determine how much time the different parts 

of the program require. It can be very useful for identifying the parts of a program 

we should focus on in our optimization efforts. One strength of profiling is thatit mE 

can be performed while running the actual program on realistic benchmark data. 
Unix systems provide the profiling program GPROF. This program generates M 

two forms of information. First, it determines how much CPU time was spent | 

for each of the functions in the program. Second, it computes a count of how 

many times each function gets called, categorized by which function performsthe MM 

call. Both forms of information can be quite useful. The timings give a sense of | P 
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the relative importancé of the different. functions in determining the overall run 
time. The calling information allows us to understand the dynamic behavior of the 
program. 

Profiling with GPROF requires three steps, as shown for a C program prog. c, 
which runs with command-line argument file.txt: 





1. The program must be compiled and linked for profiling. With ccc (and other 
C compilers), this involves simply including the run-time flag -pg on the 
command line. It is important to ensure that the compiler does not attempt to 
perform any optimizations via inline substitution, or else the calls to functions 
may not be tabulated accurately. We use optimization flag -Og, guaranteeing 
that function calls will be tracked properly. 


linux» gcc -0g -pg prog.c -o prog 
2. The program is then executed as usual: 
linux? ./prog file.txt 


Tt runs slightly (around a factor'of 2) slower than normal, but otherwise the 
only difference is that it generates a file gmon.out. 


3. GPROF is invoked to analyze the data in gmon. out: 
linux> gprof prog 


The first part of the profile report lists the times spent executing the different 
functions, sorted in descending order. As an example, the following listing shows 
this part of the report for the three most time-consuming functions in a program: 





^ cumulative self self total 

4 time seconds seconds calls  s/call  s/call name 

i 97.58 203.66 203.66 1 203.66 203.66 sort words 
2.32 208.50 4.85 965027 0.00 0.00 find ele.rec 
0.14 208.81 0.30 12511031 0.00 0.00 Strlen 





Each row represents the time spent for all calls to some function. The first 
column indicates the percentage of the overall time spent on the function. The 
second shows the cumulative time spent by the functions up to and including 
the one on this row. The third shows the time spent on this particular function, 
and the fourth shows how many times it was called (not counting recursive calls). 
In our example, the function sort words was called only once, but this single 
call required 203.66 seconds, while the function find. ele rec was called 965,027 
times (not including recursive calis), requiring a total of 4.85 seconds. Function 
Strlen computes the length of a string by calling the library function strlen. 
Library function calls are normally not shown in the results by GPROF. Their times 
are usually reported as part of the function calling them. By creating the *wrapper 
function" Strlen, we can reliably track the calls to strlen, showing that it was 
called 12,511,031 times but only requiring a total of 0.30 seconds. 
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The second part of the profile report shows the calling history of the functions. 
The following is the history for a recursive function find ele rec: 


158655725 find, ele.rec [5] 
4.85 0.10 965027/965027 insert string [4] 
[5] 2.4 4.85 0.10 965027+158655725 find ele rec [51 
0.08 0.01 363039/363039 save string [8] 
0.00 0.01 363039/363039 new ele [12] 


158655725 find ele rec [51 
i 


This history shows both the functions that called find_ele_rec, as well as 
the functions that it called. The first two lines show the calls to the function: 
158,655,725 calls by itself recursively, and 965,027 calls by function insert_string 
(which is itself called 965,027 times). Function find_ele_rec, in turn, called two 
other functions, save, string and new_ele, each a total of 363,039 times. 

From these call data, we can often infer useful information about the program 
behavior. For example, the function find, ele rec is a recursive procedure that 
scans the linked list for a hash bucket looking for a particular string. For this 
function, comparing the number of recursive calls with the number of top-level 
calls provides Statistical i 4formation about the lengths of the traversals through 
these lists. Given that their ratio is 164.4:1, we can infer that the program scanned 
an average of around 164 elements each time. 

Some properties of GPROF are worth noting: 


e The timing is not very precise. It is based on a simple interval counting scheme 
in which the compiled program maintains a counter for each function record- 
ing the time spent executing that function. The operating system causes the 
program to be interrupted at some regular time interval à. Typical values of 
8 range between 1.0 and 10.0 milliseconds. It then determines what function 
the program was executing when the interrupt occurred and increments the 
counter for that function by 8. Of course, it may happen that this function just 
started executing and will shortly be completed, but it is assigned the full cost 
of the execution since the previous interrupt. Some other function may run 
between two interrupts and therefore not be charged any time at all. 

Over a long duration, this scheme works reasonably well. Statistically, ev- 
ery function should be charged according to the relative time spent executing 
it. For programs that run for less thah around 1 second, however, the numbers 
should be viewed,as only rough estimiates. ` 

e The calling information is quite reliable, assuming no inline:substitutigns 

have been performed. The compiled program maintains a counter for each 

combination of caller and callee. The appropriate counter is incremented 

every time a procedure is called. , 

By default, the timings fór library functions are not shown. Instead, these 

times are incorporated into the times for the calling functions. 
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5.14.2 Using a Profiler to Guide Optimization 


Asan example of using a profiler to guide program optimization, we created an ap- 
plication that involves several different tasks and data structures. This application 
analyzes the n-gram statistics of a text document, Where an n-gram is a sequence 
of n words occurring in a document. For n = 1, we cóllect statistics on individual 
words, for n — 2 on pairs of words, and so on. For a given value of n, our program 
reads a text file, creates a table of unique n-grams and how many times each one 
occurs, then sorts the n-grams in descending order of occurrence. 

Asa benchmark, we ran it on a file consisting of the complete works of William 
Shakespeare, totaling 965,028 words, of which 23,706 are unique. We found that 
for n — 1, even a poorly written analysis program can readily process the entire file 
in under 1 second, and so we set n = 2 to make things more challenging. For the 
case of n — 2, n-grams are referred to as bigrams (pronounced “bye-grams”). We 
determined that Shakespeare's works contain 363,039 unique bigrams. The most 
common is “I am," occurring 1,892 times. Perhaps his most famous bigram, “to 
be,” occurs 1,020 times. Fully 266,018 of the bigrams occur only once. 

Our program consists of the following parts. We created multiple versions, 
starting with simple algorithms for the different parts and then replacing them 
with more sophisticated ones: 


1, Each word is read from the file and converted to lowercase. Our initial version 
used the function lower1 (Figure 5.7), which we know to have quadratic run 
time due to repeated calls to strlen. 

2. A hash function is applied to the string to create a number between 0 and 
s — 1, for a hash table with s buckets. Our jnitial function simply summed the 
ASCII codes for.the characters.modulo s. 

3. Each hash bucket is organized as a linked list. The program scans down this 
list looking for a matching entry. If one is found, the frequency for,this n-gram 
is incremented. Otherwise, a'new list element is created. Our initial version 
performed this operation recursively, inserting new elements at the end of the 
list. 

4. Once the table has been generated, we sort all.of the elements according to 
the frequencies. Our initial version used insertion sort. 


Figure 5.38 shows the profile results for six different versions of our n-gram- 
frequency analysis program. For each version, we divide the time into the follow- 
ing categories: 

Sort. Sorting n-grams by frequency 


List. Scanning the linked list for a matching n-gram, inserting a new element if 
necessary 


Lower. Converting strings to lowercase 


Strlen. Computing string lengths 
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Figure 5.38 Profile results for different versions of bigram-frequency counting program. Time is divided 
according to the different major operations in the program. 


Hash. Computing the hash function 


Rest. The sum of all other functions 


As part (a) of the figure shows, our initial version required 3.5 minutes, with most 
of the time spent sorting. This is not surprising, since insertion sort has quadratic 
run time and the program sorted 363,039 values. 

In our next version, we performed sorting using the library function qsort, 
which is based on the quicksort algorithm [98]. It has an expected run time of 
O(n log n). This version is labeled “Quicksort” in the figure. The more efficient i 
sorting algorithm reduces the time spent sorting to become negligible, and the 
overall run time to around 5.4 seconds. Part (b) of the figure shows the times for 
the remaining version on a scale where we can'see them more clearly. | 
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With improved sorting, we now find that list scanning becomes the bottleneck. 
Thinking that the inefficiency is due to the recursive structure of the function, 
we replaced it by an iterative one, shown as “Iter first.” Surprisingly, the run 
time increases to around 7.5 seconds. On closer study, we find a subtle difference 
between the two list functions. The recursive version inserted new elements at the 
end of the list, while the iterative one inserted them at the front. To maximize 
performance, we want the most frequent n-grams to occur near the beginning of 
the lists. That way, the function will quickly locate the common cases. Assuming 
that n-grams are spread uniformly throughout the document, we would expect 
the first occurrence of a frequent one to come before that of a less frequent 
one, By inserting new n-grams at the end, the first function tended to order n- 
grams in descending order of frequency, while the second function tended to do 
just the opposite. We therefore created a third list-scanning function that uses 
iteration but inserts new elements at the end of this list. With this version, shown 
as “Iter last,” the time dropped to around 5.3 seconds, slightly better than with the 
recursive version. These measurements demonstrate the importance of running 
experiments on a program as part of an optimization effort. We initially assumed 
that converting recursive code to iterative code would improve its performance 
and did not consider the distinction between adding to the end or to the beginning 
of a list. 

Next, we consider the hash table structure. The initial version had only 1,021 
buckets (typically, the number of buckets is chosen to be a prime number to 
enhance the ability of the hash function to distribute keys uniformly among the 
buckets), For a table with 363,039 entries, this would imply an average load of 
363,039/1,021 = 355.6. That.explains why so much of the time is spent performing 
list operations—the searches involve testing a significant number of candidate n- 
grams. It also explains why the performance is so sensitive to the list ordering. 
We then increased the number of buckets to 199,999, reducing the average load 
to 1.8. Oddly enough, however, our overall run time only drops to 5.1 seconds, a 
difference of only 0.2 seconds. 

On further inspection, we can see that the minimal performance gain with 
a larger table was due to a poor choice of bash function. Simply summing the 
character codes for a string does not produce a very wide range of values. In 
particular, the maximum code value for a letter is 122, and so a string of n char- 
acters will generate a sum of at most 122n. The longest bigram in our document, 
“honorificabilitudinitatibus thou” sums to just 3,371, and so most of the buck- 
ets in.our hash table will go unused. In addition, a commutative hash function, 
such as addition, does not differentiate among the different possible orderings of 
characters with a string. For example, the words “rat” and “tar” will generate the 
same sums. 

We switched to a hash function that uses shift and EXCLUSIVE-OR operations. 
With this version, shown as “Better hash,” the time drops to 0.6 seconds. A more 
systematic approach would be to study the distribution of keys among the buckets 
more carefully, making sure that it comes close to what one would expect if the 
hash function had a uniform output distribution. 
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Finally, we have reduced the run time to the point where most of the time is 
spent in strlen, and most of the calls to strlen occur as part of the lowercase con- 
version. We have already seen that function lower1 has quadratic performance, 
especially for long strings. The words in this document are short enough to avoid 
: the disastrous consequences of quadratic performance; the longest bigram is just ; 
32 characters. Still, switching to lower2, shown as "Linear lower,” yields a signif- 
icant improvement, with the overall time dropping to around 0.2 seconds. 

With this exercise, we have shown that code profiling can help drop the f 
time required for a simple application from 3.5 minutes down to 0.2 seconds, 3 
yielding a performance gain of around 1,000x. The profiler helps us focus our : 
attention on the most time-consuming parts of the program and also provides ; 
useful information about the procedure call structure. Some of the bottlenecks 
in our code, such as using a quadratic sort routine, are easy to anticipate, while 
others, such as whether to append to the beginning or end of a list, emerge ‘only 
through a careful analysis. 

We can see that profiling is a useful tool to have in the toolbox, but it should 
not be the only one. The timing measurements are imperfect, especially for shorter 

(less than 1 second) run times. More significantly, the results apply only to the 
particular data tested. For ‘example, if we had run the original function on data 
consisting of a smaller number of longer strings, we would have found that the 
lowercase conversion routine was the major performance bottleneck. Even worse, 
if it only profiled documents with short words, we might never detect hidden 
bottlenecks such as the quadratic performance of lower1. In general, profiling can 
help us optimize for typical cases, assuming we run the program on representative 
data, but we should also make sure the program will have respectable performance 
for all possible cases. This mainly involves avoiding algorithms (such as insertion 
sort) and bad programming practices (such as lower1) that yield poor asymptotic 
performance. 

Amdahl's law, described in Section 1.9.1, provides some additional insights 
into the performance gains that can be obtained by targeted optimizations. For our 
n-gram code, we saw the total execution time drop from 209.0 to 5.4 seconds When 
we replaced insertion sort by quicksort. The initial version spent 203.7 of its 209.0 
seconds performing insertion sort, giving o = 0.974, the fraction of time subject 
to speedup. With quicksort, the time spent sorting becomes negligible, giving a 
predicted speedup of 209/a = 39.0, close to the measured speedup of 38.5. We 
were able to gain a large speedup because sorting constituted a very large fraction 
of the overall execution time. However, when one bottleneck ‘is eliminated, a new 
one arises, and so gaining additional speedup required focusing on other parts of 


the program. 









5.15 Summary 


Although most presentations on code optimization describe how compilers can 
i generate efficient code, much can be done by an application programmer to assist 
the compiler in this task. No compiler can replace an inefficient algorithm or data 
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structure by a good one, and so these aspects of program design should remain 
a primary concern for programmers. We also have seen that optimization block- 
ers, such as memory aliasing and procedure calls, seriously restrict the ability of 
compilers to perform extensive optimizations. Again, the programmer must take 
primary responsibility for eliminating these. These should simply be considered 
parts of good programming practice, since they serve to eliminate unneeded work. 

Tuning performance beyond a basic level requires some understanding of the 
processor’s microarchitecture, describing the underlying mechanisms by which 
the processor implements its instruction set architecture;.For the case of out- 
of-order processors, just knowing something about ‘the operations, capabilities, 
latencies, and issue times of the functional units establishes a baseline for predict- 
ing program performance. 

We have studied a series of techniques—including loop unrolling, creating 
multiple accumulators, and reassociation—that can exploit the instruction-level 
parallelism provided by:modern processors. As we get deeper into the optimiza- 
tion, it becomes important to study the generated assembly code and to try to 
understand how the computation is being performed by the machine. Much can 
be gained, by identifying the critical paths determined by the data dependencies 
in the program, especially between the different iterations of a loop. We can also 
compute a throughput bound for a computation, based on the number of oper- 
ations that must be computed and the number and issue times of the units that 
perform those operations. 

Programs that involve conditional branches or complex interactions with 
the.memory system are more difficult to analyze and optimize than the simple 
lgop programs we first. considered. The basic strategy is to try to make branches 
more predictable or make them amenable to implementation using conditional 
data transfers. We must also watch out for the interactions between store and 
load operations. Keeping values in local variables, allowing them to be stored in 
registers, can often be helpful. 

When working with large programs, it becomes important to focus our op- 
timization efforts on the parts that consume the most time. Code profilers and 
related tools can help us systematically evaluate and improve program perfor- 
mance. We described Gpror, a standard Unix profiling tool. More sophisticated 
profilers are available, such as the VTUNE prógram development system from In- 
tel, and VALGRIND, commonly available on Linux systems. These tools can break 
down the execution time below the procedure level to estimate the performance 
of each basic block ofthe program. (A basic block is a sequence of instructions that 
has no transfers of control out of its middle, and so the block is always executed 
in its entirety.) 
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takes a similar approach but goes into more detail with respect to the processor's 
characteristics. 

Many publications describe code optimization from a compiler’s perspective, 
formulating ways that compilers can generate more efficient code. Muchnick’s 
book is considered the most comprehensive [80]. Wadleigh and Crawford’s book 
on software optimization [115] covers some of the material we have presented, 
but it also describes the process of getting high performance on parallel machines. 
An early paper by Mahlke et al. [75] describes how several techniques developed 
for compilers that map programs onto parallel machines can be adapted to exploit 
the instruction-level parallelism of modern processors. This paper covers the code 
transformations we presented, including loop unrolling, multiple accumulators 
(which they refer to as accumulator variable expansion), and reassociation (which 
they refer to as tree height reduction). : 

Our presentation of the operation of an out-of-order processor is fairly brief 
and abstract. More complete descriptions of the general principles can be foundin 
advanced computer architecture textbooks, such as the one by Hennessy and Pat- 
terson [46, Ch. 2-3]. Shen and Lipasti's book [100] provides an in-depth treatment 
of modern processor design. 


Homework Problems 


5.13 99 

Suppose we wish to write a procedure that computes the inner product of two 
vectors u and v. An abstract version of the function has a CPE of 14-18 with x86- 
64 for different types of integer and floating-point data. By doing the same sort 
of transformations we did to transform the abstract program combine1 into the 
more efficient combine4, we get the following code: 


1 /* Inner product. Accumulate in temporary x/ 

2 void inner4(vec ptr u, vec ptr V, data t *dest) 
3 t 

4 long i; 

5 long length = vec length(u); 

6 data t *udata = get vec start(u); 

7 data t *vdata = get vec start(v); 

8 data t sum = (data t) 0; 

9 : 


10 for (i = 0; i < length; i++) { 

11 sum = sum + udata[i] * vdata [il ; 
12 } 

13 dest = sum; 


14 +} 


Our measurements show that this function has CPEs of 1.50 for integer data ! 
and 3.00 for-fioating-point data. For data type double, the x86-64 assembly code 4 
for the inner loop is as follows: 5 
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Inner loop of inner4. data_t = double, OP = * 
udata in %rbp, vdata in Xrax, sum in /xmmO 
i in &rcx, limit in %rbx 


1 .L15: loop: 

2 vmovsd O(%rbp,%rcx,8), %xmmi Get udata[i] 

3 vmulsd (%rax,%rcx,8), %xmmi, %xmmi Multiply by vdata[i] 
4 vaddsd %xmmi, %xmm0, %xmmO Add to sum 

5 addq $1, Xrcx Increment i 

6 cmpq Arbx, Arcx Compare i:limit 

7 jne .L15 If !=, goto loop 


Assume that the functional units have the characteristics listed in Figure 5.12. 


A. Diagram how this instruction sequence would be decoded into operations 
and show how the data dependencies between them would create a critical 
path of operations, in the style of Figures 5.13 and 5.14. 


B. For data type double, what lower bound on the CPE is determined by the 
critical path? 

C. Assuming similar instruction sequences for the integer code as well, what 
lower bound on the CPE is determiried by the critical path for integer data? 


D. Explain how the floating-point versions car have CPEs of 3.00, even though 
the multiplication operation requires 5 clock cycles. 


5.14 € 

Write a version of the inner product procedure described in" Problem 5.13 that 
uses 6 x 1 loop unrolling. For x86-64, our measurements of the unrolied version 
give a CPE of 1.07 for integer data but still 3.01 for both floating-point data. 


A. Explain why any (scalar) version of an inner product procedure running on 
an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00. 


B. Explain why'the performance for flóatiug-point data did not improve with 
loop unrolling. i 


5.15 € 
Write a version of the inner product procedure described in Problem 5.13 that 
uses 6 x 6 loop unrolling. Our measurements for this function with x86-64 give a 
CPE of 1.06 for integer data and 1.01 for floating-point data: 

What factor limits the performance to a CPE of 1.00? 


5.16 @ 
Write a version of the inner product procedure described in Problem 5.13 that 
uses 6 x la loop unrolling to enable greater parallelism. Our measurements for 
this function give a CPE of 1.10 for integer data and 1.05 for floating-point data. 


5.17 €€ 
The library function memset:has the following prototype: 


void *memset(void *s, int c, size t n); 
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This function fills n bytes of the memory area starting at s with copies of the low- 
order byte of c. For example, it can be used to zero out a region of memory by 
giving argument 0 for c, but other values are possible. 

The following is a straightforward implementation of memset: 


1 /* Basic implementation of memset */ 
2 void *basic_memset (void *s, int c, size t n) 
3 í( 

4 size t cnt = 0; 

5 unsigned char *schar = 8; 

6 while (cnt < n) { 

7 *schart+ = (unsigned char) c; 
8 cnttt; 

9 } 

10 return Sj 

11 } 


Implement a more efficient version of the function by using a word of data 
type unsigned long to pack eight copies of c, and then step through the region 
using word-level writes. You might find it helpful to do additional loop unrolling 
as well. On our reference machine, we were able to reduce the CPE from 1.00 for 
the straightforward implementation to 0.127. That is, the program is able to write 
8 bytes every clock cycle. 

Here are some additional guidelines. To ensure portability, let K denote the | 
value of sizeof (unsigned long) for the machine on which yourun your program. j 


e You may not call any library functions. 


e Your code should work for arbitrary values of n, including when it is not 
- multiple of K. You can do this in a manner similar to the way we finish the 
last few iterations with loop unrolling. l 


e You should write your code so that it will compile and run correctly on any : 
machine regardless of the value of K. Make use of the operation sizeof to 4 
do this. 


e On some machines, unaligned writes can be much slower than aligned ones. | 
(On some non-x86 machines, they can even cause segmentation faults.) Write 
your code so that it starts with byte-level writes until the destination address | 
is a multiple of K, then do word-level writes; and then (if necessary) finish 


with byte-level writes. 


e Beware of the case where cnt is small enough that the upper bounds on 
some of the loops become negative. With expressions involving the sizeof i 
operator, the testing may be performed with unsigned arithmetic. (See Sec- 1 
tion 2.2.8 and Problem 2.72.) 1 


5.18 999 1 
We considered the task of polynomial evaluation in Practice Problems 5.5 and 5.6, 
with both a direct evaluation and an evaluation by Horner's method. Try to writes 
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faster versions of the function using the optimization techniques we have explored, 
including loop unrolling, parallel accumulation, and reassociation. You will find 
many different ways of mixing together Hgrner's scheme and direct evaluation 
with these optimization techniques. 

Ideally, you should be able to reach a CPE close to the throughput limit of 
your machine. Our best version achieves a CPE of 1.07 on our reference machine. 


5.19 999 

In Problem 5.12, we were able to reduce the CPE for the prefix-sum computation 
to 3.00, limited by the latency of floating-point addition on this machine. Simple 
loop unrolling does not improve things. 

Using a combination of loop unrolling and reassociation, write code for a 
prefix sum that achieves a CPE less than the latency of floating-point addition 
on your machine. Doing this reqüirés actually increasing the number of additions 
performed. For example, our version with two-way unrolling requires three ad- 
ditions per iteration, While our vetsion with four-way unrolling’requifes five. Our 
best implementation achieves a CPE of 1.67 on our reference machine. 

Determine how the throughput and latency limits of your machine limit the 
minimum CPE you can achieve for the prefix-sum operation. 


Solutions to Practice Problems 


Solution to Problem 5.1 (page 500) 
This problem illustrates some of the subtle effects of memory aliasing. 
As the following commented code shows, the effect will be to set the value at 


xp to zero: 

4 *Xp = *xp + *xp; /* 2x */ 

5 *Xp = *xp - *xp; /* 2x-2x = 0 */ 
6 *Xp = *xp - *xp; /* 0-0 = 0 */ 


This example illustrates that our intuition about program behavior can often 
be wrong. We naturally think of the case where xp and yp are distinct but overlook 
the possibility that they might be equal. Bugs often arise due to conditions the 
programmer does not anticipate. 


Solution to Problem 5.2 (page 504) 

This problem illustrates the relationship between CPE and absolute performance. 
It can be solved using elementary algebra. We find that for n « 2, version 1 is the 
fastest. Version 2 is fastest for 3 < n < 7, and version 3 is fastest for n > 8. 


Solution, to Problem 5.3 (page 512) 

This is a simple exercise, but it is important to recognize that the four statements 
of a for loop—initial, test, update, and body—get executed differerit numbers of 
times. 
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ymax incr square 


Solution to Problem 5.4 (page 516) 
This assembly code demonstrates a clever optimization opportunity detected by 
ccc. It is worth studying this code carefully to better understand the subtleties of 
code optimization. 

A. Inthe less optimized code, register Zxmm0 i$ simply used as a temporary value, 


both set and used on each loop iteration, In the more optimized code, it is 
used more in the manner of variable acc in combine4, accumulating the 
product of the vector elements. The difference with combine4, however, 
is that location dest is updated on each iteration by the second vmovsd 
instruction. 

We can see that this optimized version operates much like the following 


C code: 


/* Make sure dest updated on each iteration */ 
void combine3w(vec_ptr v, data t *dest) 
í 

long i; 

long length = vec length(v); 

data t *data - get vec. start(v); 

data t acc = IDENT; 


/* Initialize in event length <= O */ 
*dest = acc; 


for (i = 0; i < length; i++) { 
acc = acc OP datali]; 
*dest = acc; 


. The two versions of combine3 will have identical functionality, even with | 
memory aliasing. 


. This transformation can be made without changing the program behavior, | 
because, with the exception of the first iteration, the value read from dest at | 
the begiríning of éach iteration will be the sáme value written to this register 
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at the end of the previous iteration. Therefore: the combining instruction 
can simply use the value already in %xmm0 at the beginning of the loop. 


Solution to Problem 5.5 (page'530) 


Polynomial evaluation is a core technique for solving many problems. For example, 
polynomial functions are commonly used to approximate trigonometric functions 
in math libraries. 


A. The function performs 2n multiplications and n additions. 


B. Wecan see that the performance-limiting computation here is the repeated 
computation of the expression xpwr = x * xpur. This requires a floating- 
point multiplication (5 clock cycles), and the computation for one iteration 
cannot begin until the one for the previous iteration has completed. The 
updating of result only requires a floating-point addition (3 clock cycles) 
between successive iterations. 


Solution to Problem 5.6 (page 530) 
This problem demonstrates that minimizing the number of operations in a com- 
putation may not improve its performance. 


A. The function performs n multiplications dnd n additions, half the number of 
multiplications ds the original function poly’ 


B. We can see that the performance-limiting computation here is the, repeated 
, computation of the expression result = a[i] + x*result, Starting from the 
value of result from the previous iteration, we must first multiply it by x (5 
clock cycles) and then add it to a [i] (3 cycles) before we have the value for 
this iteration. Thus, each iteration imposes a minimum latency of 8 cycles, 
exactly our measured CPE. 


C. Although each iteration in function poly requires two'multiplicátions rather 
than one, only a single multiplication occurs along the critical path per 
iteration. 


Solution to Problem 5.7 (page 532) 


The following code directly follows the rules we have stated for unrolling a loop 
by some factor k: 


1 void unroll5(vec ptr v, data t *dest) 
2 4 

3 long i; 

4 long length = vec, length(v); 

5 long limit - length-4; 
6 

7 

8 


data t *data = get vec start(v); 
data t acc = IDENT; 
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Solution to Problem 5.8 (page 545) 

This problem demonstrates how small changes in a program can yield dramatic 
performance differences, especially on a machine with out-of-order execution. 
Figure 5.39 diagrams the three multiplication operations for a singlé iteration 
of the function. In this figure, the operations shown as blue boxes are along the 
critical path—they need to be computed in sequence to compute a new value for 
loop variable r. The operations shown as light boxes can be computed in parallel 
with the critical path operations. For a loop with P operations along the critical 
path, each iteration will require a minimum of 5P clock cycles and will compute 
the product for three elements, giving a lower bound on the CPE of 5P/3. This 
implies lower bounds of 5.00 for A1, 3.33 fór A2 and A5, ahd 1.67 for A3 and A4. 
We ran these functions on an Intel Core i7 Haswell processor and found that it 
could acbieve these CPE values. 
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Aj i 
9 /* Combine 5 elements at a time */ 
10 for (i = 0; i < Limit; i+=5) 1 
11 acc = acc OP data(i] OP data littl; 
12 acc = acc OP data(i+2] OP data[i*3]; 
13 acc = acc OP datafit4]; F i 

d 14 : } á 

i 15 

| 16 /* Finish any remaining elements */ 

i 17 for (; i < length; i++) { 

i 18 acc = acc OP detalil; 

| 19 } 

| 20 *dest = acc; 

i zo») 

| 

1 
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Solution to Problem 5.9 (page 553) 
This is another demonstration that a slight change in coding style can make it much 
easier for the compiler to detect opportunities to use conditional moves: 


while (i1 < n && i2« n) { 
long vi = srci[it]; 


At: ((r*x) #y) *z A2: (rx (x¥y)) *z A3: z* ((Goky) *z) A4: r*(x* (y*z)) AB: (rex) *Cy*z) 
re ye Ty iz RERESET aTr Ea 
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Figure 5.39 Data dependencies among multiplication operations for cases in Problem 5.8. The 
operations shown as blue boxes form the critical paths for the iterations. 





Solutions to Practice Problems 4577 


V ' long 2 = src2[i2]; 
" .longLtakei ¢ vi < v2; ro owi 
dest [id++] = takel ? vi : v2; 


ii += takei; 
i2 += (1-take1); 
} 


We measured a CPE of around 12.0 for this version of the code, a modest improve- 
ment over the original CPE of 15.0. © za ` 


Solution to Problem 5.10 (page 559) 
This problem requires you to analyze, the potential load-store interactions in a 
program. 


A. It will set each clement ali] to i -- 1, for 0 <i < 998. 
¥ "ul ¥ ig n 1 
B. It will set each element a[i] to 0, fpr 1 <i « 999, 


C. Inthe second case, the load of one iteration depends on the result of the store 
from the previous iteration. Thus, there is a write/read dependency between 
successive iterations. 


D. It will give a CPE of 12, the same as for Example A, since there are no 
dependencies between stores and subsequent loads. 


Solution to Problein 5.11 (page 561) 

We can see that this function has a write/read dependency between successive 
iterations—the destination value p [i] on one iteration matches the source value 
p[i-1] on the next. A critical path is therefore formed for each iteration consisting 
of a store (from the previous iteration), a load, and a floating-point addition. 
The CPE measurement of 9:0 is consistent with our measurement of 7.3 for the 
CPE of write, read when there is a data dependency, since write, read involves 
an integer addition (1 clock-cycle latency), while psum1 involves a floating-point 
addition (3 clock-cycle latency). 


Solution to Problem 5.12 (page 561) 
Here is a revised version of the function: 


void psumia(float a[], float p[], long n) 
1 
long i; 
/* last val holds p[i-1]; val holds p[i] */ 
float last val, val; 
last. val = p[0] = a[0]; 
for (i = 1; i < n; i++) { 
val = last val + a[i]; 
pli] = val; 
last_val = val; 
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We introduce a local variable last. val. At the start of iteration i, it holds the 
value of p[i-1]. We then compute val to be the value of p [1] and.to be the new 
value for last. val. 

This version compiles to the following assembly code: 


Inner loop of psumia 
a in &rdi, i in Xrax, cnt in %rdx, last_val in AxmmO 


1 .L16: loop: 

2 vaddss (%rdi,%rax,4), %xmmO, %xmmO last.val + val = last val + ali] 
3 vmovss %xmm0, ({rsi,%rax,4) Store val in pli] 

4 addq $1, %rax Increment 1 

5 cupq Xrdx, %rax Compare i:cnt 

6 jne .L16 If l=, goto loop 


This code holds last. val in %xmm0, avoiding the need'to read pli-1) from 
memory and thus eliminating the write/read dependéncy seen in psum1. 


t 














Fr 


The Memory Hierarchy 


6.1 Storage Technologies 581 

6.2 Locality 604 

6.3 The Memory Hiérarchy 609 

6.4 Cache Memories 614 

65 Writing Caiche Friendly Code 633 


6.6 Putting It Together: The Impact of Caches on Program 
Performance 639 


6.7 Summary 648 
Bibliographic Notes 648 
Homawork Problems 649 


Solutions to Practice Problems 660 


580 Chapter6 The Memory Hierarchy 


T? this point in our study of systems, we have relied on a simple model of a 
computer system as a CPU that executes instructions and a memory system 
that holds instructions and data for the CPU. In our simple model, the memory 4 
system is a linear array of bytes, and the CPU can access each memory locationin 4 
a constant amount of time. While this is an effective model up to a point, it does | 
not reflect the way that modern systems really work. j 

In practice, a memory system is a hierarchy of storage devices with different 8 
capacities, costs, and access times. CPU registers hold the most frequently used ; 
data. Small, fast cache memories nearby the CPU act as staging areas for a subset | 
of the data and instructions stored in the relatively slow main memory. The main. | 
memory stages data stored on large, slow disks, which in turn often serve as j 
staging areas for data stored on the disks or tapes of other machines connected by . 
networks, 

Memory hierarchies work because well-written programs tend to access the 3 
storage at any particular level more frequently than they access the storage at the 1 
next lower level. So the storage at the next level can be slower, and thus larger 3 
and cheaper per bit. The overall effect is a large pool of memory that costs as 
much as the cheap storage near the bottom of the hierarchy but that serves data § 
to programs at the rate of the fast storage near the top of the hierarchy. 

As a programmer, you need to understand the memory hierarchy because it § 
has a big impact on the performance of your applications. If the data your program 4 
needs are stored in a CPU register, then they can be accessed in 0 cycles during 
the execution of the instruction. If stored in a cache, 4 to 75 cycles. If stored in j 
main memory, hundreds of cycles. And if stored in disk, tens of millions of cycles! 4 

Here, then, is a fundamental and enduring idea in computer systems: if you j 
understand how the system moves data up and down the memory hierarchy, then | 
you can write your application programs so that their data items are stored higher 4 
in the hierarchy, where the CPU can access them more quickly. ] 

This idea centers around a fundamental property of computer programs 
known as locality. Programs with good locality tend to access the same set of 1 
data items over and over again, or they tend to access sets of nearby data items | 
Programs with good locality tend to access more data items from the upper levels 
of the memory hierarchy than programs with poor locality, and thus run faster. 
For example, on our Core i7 system, the running times of different matrix mul- 
tiplication kernels that perform the same number of arithmetic operations, but! 
have different degrees of locality, can vary by a factor of almost 40! 

In this chapter, we will look at the basic storage technologies——SRAM mem- 
ory, DRAM memory, ROM memory, and rotating and solid state disks—and 
describe how they are organized into hierarchies, In particular, we focus on the] 
cache memories that act as staging areas between the CPU and main memory, be- 
cause they have the most impact on application program performance. We show 
you how to analyze your C programs for locality, and we introduce techniques for 
improving the locality in your programs. You will also learn an interesting way to 
characterize the performance of the memory hierarchy on a particular machine 
as a "memory mountain" that shows read access times as a function of locality. 
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6.1 Storage Technologies 


Much of the success of computer technology stems from the tremendous progress 
in storage technology. Early computers had a few kilobytes of random access 
memory. The earliest IBM PCs didn't even have a hard disk. That changed with 
the introduction of the IBM PC-XT in 1982, with its 10-megabyte disk. By the year 
2015, typical machines had 300,000 times as much disk storage, and the amount of 
storage was increasing by a factor of 2 every couple of years. 


6.1.1 Random Access Memory 


Random access memory (RAM) comesin two varieties—static and dynamic. Static 
RAM (SRAM) is faster and significantly more expensive than dynamic RAM 
(DRAM). SRAM is used for cache memories, both on and off the CPU chip. 
DRAM is used for the main memory pius the frame buffer of a graphics system. 
Typically; a desktop system will have no more than a few tens of megabytes of 
SRAM, but hundreds or thousands of megabytes of DRAM. 


Static RAM 


SRAM stóres each bit in a bistable memory cell. Each cell is implemented with 
a six-transistor circuit. This circuit has the property that it can stay indefinitely 
in either óf two different voltage configurations, or states. Any other state will 
be unstable— starting from there, the circuit will quickly move toward one of the 
stable states. Such a memory cell is analogous to the inverted pendulum illustrated 
in Figure 6.1. 

The pendulum is stable when it is tilted either all the way to the left or alhthe 
way to the right. From any other position, the pendulum will fall to one side or the 
other. In principle, the pendulum could also remain balanced in a vertical position 
indefinitely, but this state is rmetastable—the smallest disturbance would make it 
start to fall, and once it fell it would never return to the vertical position. 

Due to its bistable nature, an SRAM memory cell will retain its value indef- 
initely, as long as it is kept powered. Even when a disturbance, such as electrical 
noise, perturbs the voltages, the circuit will return to the stable value when the 
disturbance is removed. 


Figure 6.1 

Inverted pendulum. 
Like an SRAM cell, the 
pendulum has only two 






Stable left 


stable configurations, or à 


Unstable Stable right 
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Transistors Relative Relative ; 

per bit access time Persistent? Sensitive? cost Applications i 

SRAM 6 1x Yes No 1,000x Cache memory ; 
DRAM 1 10x No Yes 1x Main memory, frame buffers 


Figure 6.2 Characteristics of DRAM and SRAM memory. 


Dynamic RAM 


DRAM stores each bit as charge on a capacitor. This capacitor is very small— 
typically around 30 femtofarads—that is, 30 x 10715 farads. Recall, however,that 
a farad is a very large unit of measure. DRAM storage can be made very dense— 
each cell consists of a capacitor and a single access transistor. Unlike SRAM, 
however, a DRAM memory cell is very sensitive to any disturbance. When the 
capacitor voltage is disturbed, it will never recover. Exposure to light rays will 
cause the capacitor voltages to change. In fact, the sensors in digital cameras and 
camcorders are essentially arrays of DRAM cells. 

Various sources of leakage current cause a DRAM cell to lose its charge 
within a time period of around 10 to 100 milliseconds. Fortunately, for computers 
Operating with clock cycle times measured in nanoseconds, this retention time is 
quite long. The memory system must periodically refresh every bit of memory by 
reading it out and then rewriting it. Some systems also use error-correcting codes, 
where the computer words are encoded using a few more bits (e.g., a 64-bit word 
might be encoded using 72 bits), such that circuitry can detect and correct any 
single erroneous bit within a word. 

Figure 6.2 summarizes the characteristics of SRAM and DRAM memory. 
SRAM is persistent as long as power is applied. Unlike DRAM, no refresh is 
necessary. SRAM can be accessed faster than DRAM. SRAM is not sensitive to 
disturbances such as light and electrical noise. The trade-off is that SRAM cells 
use more transistors than DRAM cells and thus have lower densities, are more 
expensive, and consume more power. 


Conventional DRAMs 


The cells (bits) in a DRAM chip are partitioned into d supercells, each consisting $ 
of w DRAM cells. A d x w DRAM stores a total of dw bits of information. The 4 
supercells are organized as a rectangular array with r rows and c columns, where } 
rc = d. Each supercell has an address of the form (i, j), where'i denotes thé row 3 
and j denotes the column. 
For example, Figure 6.3 shows the organization of a 16 x 8 DRAM chip with 1 
d = 16 supercells, w = 8 bits per supercell, r = 4 rows, and c = 4 columns The ‘ 
shaded box denotes the supercell at address (2, 1). Information flows in and out : 
of the chip via external connectors called pins. Each pin carries a 1-bit signal. 4 
Figure 6.3 shows two of these sets of pins: eight data pins that can transfer 1 byte 4 
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Figure 6.3 

High-level view of a 
128-bit 16 x 8 DRAM 
chip. 


ontroller 
(to CPU) | °°"? 





in or out of the chip, and two addr pins that carry two-bit row and column supercell 
addresses. Other pins that carry control information are not shown. 

Each DRAM chip is connected to some circuitry, known as the memory 
controller, that can transfer w bits at a time to and from each DRAM chip. To read 
the contents of supercell (i, j), the memory controller sends the row address i to 
the DRAM, followed by the column address j. The DRAM responds by sending 
the contents of supercell (i, j) back to the controller. The row address i is called 

a RAS (row access strobe) request. The column address j is called a CAS (column 
access strobe) request. Notice that the RAS and CAS requests share the same 
DRAM address pins. 

For example, to read supercell (2, 1) from the 16 x 8DRAM in Figure 6.3, the 
memory controller sends row address 2, as shown in Figure 6.4(a). The DRAM 
responds by copying the entire contents of row 2 into an internal row buffer. N ext, 
the memory controller sends column address 1, as shown in Figure 6.4(b). The 
DRAM responds by copying the 8 bits in supercell (2, 1) from the row buffer and 
sending them to the memory controller. 

One reason circuit designers organize DRAMs as two-dimensional arrays 
instead of linear arrays is to reduce the number of address pins on the chip: For 
example, if our example 128-bit DRAM were organized.as a linear array of 16 

: supercells with addresses 0 to 15, then the chip would need four address. pins 
| instead of two.'The disadvantage of the two-dimensional array organization is 
t that addresses must be sent in two distinct steps, which increases the access time. 
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(a) Select row 2 (RAS request). 


Memory Modules 


in 64-bit chunks. 


SO OD. 


sends (i, j) to module &. 


Memory 
controller 


Figure 6.4 Reading the contents of a DRAM supercell. 
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(b) Select column 1 (CAS request). 


DRAM chips are packaged in memory modules that plug into expansion slots on 
the main system board (motherboard). Core i7 systems use the 240-pin dual inline 
memory module (DIMM), which transfers data to and from the memory controller 


Figure 6.5 shows the basic idea of a memory module. The example module 
stores a total of 64 MB (megabytes) using eight 64-Mbit 8M x 8 DRAM chips, 
numbered 0 to 7. Each supercell stores 1 byte of main memory, and each 64-bit 
word at byte address A in main memory is represented by the eight supercells 
whose corresponding supercell address is (i, j).'In the example in Figure 6.5, 
DRAM 0 stores the first (lower-order) byte, DRAM 1 stores the next byte, and 


To retrieve the word at memory address A, the memory controller converts 
A to a supercell address (i, j) and sends it to the memory module, which then 
broadcasts i and j to each DRAM. In response, each DRAM outputs the 8-bit 
contents of its (i, j) supercell. Circuitry in the module collects these outputs and 
forms them into a 64-bit word, which it returns to the memory controller. 

Main memory can be aggregated by connecting multiple memory modules to 
the memory controller. In this case, when the controller receives an address A, the 
controller selects the module & that'contains A, converts A to its (i, j) form, and 





In the following, let r be the rami: of rows in a a DRAM aaa c the number of 
columns, 5, the number of bits needed to address the rows, and b, the number of 
bits needed to address the columns. For each of the following DRAMs, determine 
the power-of-2 array dimensions that minimize max(b,, b,), the maximum number 
of bits needed to address the rows‘or columns of the array. : 
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Figure 6.5 addr (row = f; col = y 
Reading the contents of a 
memory module. 
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Enhanced.DRAMs 


There are many-kinds ôf DRAM memriés, and new kinds appear on the market 
With regularity as manufacturers attempt to keep up with rapidly increasing pro- 
cessor speeds. Each is based on the conventional DRAM cell, with optimizations 
that improve the speed with which the basic DRAM cells can be accessed. 


Fast page mode DRAM (FPM DRAM). A conventional DRAM copies an 
entire row of supercells into its internal row buffer, uses one, and then 
discards the rest. FPM DRAM improves on this by allowing consecutive 
accesses to the same row to be served directly from-the row buffer. For 
example, to read four supercells from row i of a conventional DRAM, the 
memory controller must send four RAS/CAS requests, even though the 
row address i is identical in‘each case, To read supercelJs from the same. 
row of an FPM DRAM, the memory controller sends an ipitial RAS/CAS 
request, followed by three CAS requests. The initial RAS/CAS request 
copies row i into the row buffer and returns the supercell addressed by the 
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CAS, The next three supercells are served directly from the row buffer 
and thus are returned more quickly than the initial supercell. 


Extended data out DRAM (EDO DRAM). An enhanced form of FPN 
DRAM that allows the individual CAS signals to be spaced closer to 
gether in time. 


: Synchronous DRAM (SDRAM). Conventional, FPM, and EDO DRAMS are 
asynchronous in the sense that they commuinicate with the memory con. 
troller using a set of explicit control signals. SDRAM replaces many o: 
these control signals with the rising edges of the same external clock sig 
nal that drives the memory controller. Without going into detail, the ne 
effect is that an SDRAM can output the contents of its supercells at : 
faster rate than its asynchronous counterparts. 


Double Data-Rate Synchronous DRAM (DDR SDRAM). DDRSDRAM is ar 
enhancement of SDRAM that doubles the speed of the DRAM by using 
both clock edges as control signals. Different types of DDR SDRAM: 
are characterized by the size of a small prefetch buffer that increases the 
effective bandwidth: DDR (2 bits), DDR2 (4 bits), and DDR3 (8 bits). 


Video RAM (VRAM ). Used in the frame buffers of graphics systems. VRAM 
is similar in spirit to FPM DRAM. Two major differences are that (1) 
VRAM output is produced by shifting the entire contents of the interna 
buffer in sequence and (2) VRAM allows concurrent reads and writes tc 
the memory. Thus, the system can be painting the screen with the pixel: 
in the frame buffer (reads) while concurrently writing new values for the 
next update (writes). 


Nonvolatile Memory 


DRAMs and SRAMs are volatile in the sense that they lose their information if thi 
supply voltage is turned off. Nonvolatile memories, on the other hand, retain thei 
information even when they are powered off. There are a variety of nonvolatil: 
memories. For historical reasons, they are referred to collectively as read-onl 
memories (ROMS), even though some types of ROMs can be written to as well a 
read. ROMS are distinguished by the number of times they can be reprogramme 
(written to) and by the mechanism for reprogramming them. 
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Aside Historical „popularity: idi DRAM technofegles Sos 


Until 1995, most PCs were built: viti FPM DRAMS. From 1996 to 1999: EDO DRAMs dominated the 
market, while FPM DRAM; all but’ disappeared: ‘SDRAMs first appe: earéd i in 1995* if high-end’ systems, 
and by 2002 miost PCs were built Wi with SDRAMs and DDR SDRAMS.. ‘By 2010, most server ahd desktop 
systems were built with DDR SDRAMs In‘fact, the Intel Core’ drvupports only BRR SDRAM. 
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A programmable ROM (PROM) can be programmed exactly once. PROMs 
include a sort of fuse with each memory cell that can be blown once by zapping it 
with a high current. 

An erasable programmable ROM (EPROM) has a transparent quartz window 
that permits light to reach the storage cells. The EPROM cells are, cleared to zeros 
by shining ultraviolet light through the window. Programming an EPROM is done 
by using a special device to write ones into the EPROM. An EPROM can be 
erased and reprogrammed on the order of 1,000 times. An electrically erasable 
PROM (EEPROM) is akin to an EPROM, but it does not require a physically 
separate programming device, and thus can be reprogrammed in-place on printed 
circuit cards. An EEPROM can be reprogrammed on the order of 10° times before 
it wears out. 

Flash memory is a type of nonvolatile memory, based on EEPROMs, that 
has become an important storage technology. Flash memories are everywhere, 
providing fast and durable nonvolatile storage for a slew of electronic devices, 
including digital cameras, cell phones, and music players, as well as laptop, desktop, 
and server computer systems. In Section 6.1.3, we will look in detail at a new form 
of flash-based disk drive, known as a solid state disk (SSD), that provides a faster, 
sturdier, and less power-hungry alternative to conventional rotating disks. 

Programs stored in ROM devices are often referred to as firmware. When a 
computer system is powered up, it runs firmware stored in a ROM. Some systems 
provide a small set of primitive input and output functions in firmware—for 
example, a PC's BIOS (basic input/output system) routines. Complicated devices 
such as graphics cards and disk drive controllers also rely on firmware to translate 
I/O (input/output) requests from the CPU. 


Accessing Main Memory 


Data flows back and forth between the processor and the DRAM main memory 
over shared electrical conduits called buses. Each transfer of data between the 
CPU and memory is accomplished with a series of steps called a bus transaction. 
A read transaction transfers data from the main memory to the CPU. A write 
transaction transfers data from the CPU to the main memory. 

A bus is 3 collection of parallel wires that carry address, data, and control 
signals. Depending on the particular bus design, data and address signals can share 
the same set of wires or can use different sets. Also, more than two devices can 
share the same bus. The control wires carry signals that synchronize the transaction 
and identify what kind of transaction is currently being performed. For example, 
is this transaction of interest to the main memory, or to some other I/O device 
such as a disk controller? Is the transaction a read or a write? Is the information 
on the bus an address or a data item? 

Figure 6.6 shows the configuration of an example computer system. The main 
components are the CPU chip, a chipset that we wil) call an VO bridge (which 
includes the memory controller), and the DRAM memory modules that make up 
main memory. These components are connected by a pair of buses: a system bus 
that connects the CPU to the I/O bridge, and a memory bus that connects the VO 
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Aside A note on bys designs i 


a * K K sg 
Bus design is a complex and rapidly changing aspect of computer systems. Different'véndorssdevelop 
different bus architectüurés asd way to, differeritiáte their products. For'ekample; some Intel Systérás use 
chipsets known as thénorthbridge alid the soüthDridge to connect the CPU to memory and I/O devices, 
respectively. In older Peritium and'Core 2 systenisa fronf'side bus (FSBY éonrlects'the GPU‘td the 
northbridge. Systems from-AMD replace-the-FSB with ‘thé HyperTransport interconnect;while newer 
Intel Core i7 systems use the QuickPath interconnect: The detailsof’ these different bus architectures 
are beyond the scopé of this'text?Instéad, Wé will use tfie'high-level bus architecture froth Figure 6:6 
as a running examplé throughout. It is a siniplé but'useful abstraction-that allows us‘to-be' concrete. It 
captures the main ideàs without being tied tco clósely-to the detail of any ‘proprietary ‘desigiis, 
Eh 
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Figure 6.6 CPU chip 

Example bus structure Register file 

that connects the CPU 

and main memory. PR 
eget 


System bus Memory bus 


Bus interface 


bridge to the main memory. The I/O bridge translates the electrical signals of the 
system bus into the electrical signals of the memory bus. As we will see, the I/O 
bridge also connects the system bus and memory bus to an J/O bus that is shared 
by I/O devices such as disks and graphics cards. For now, though, we will focus on 
the memory bus. 

Consider what,happens when the CPU performs a load operation such as 


movq A, 4rax 


where the contents of address A are loaded into register %rax. Circuitry on the 
CPU chip called the bus interface initiates a read transaction on thé bus. The 
read transaction consists of three steps. First, the CPU places the address A 
on the system bus. The I/O bridge passes the signal along to the memory bus 
(Figure 6.7(a)). Next, the main memory senses the address signal on the memory 
bus, reads the address from the memory bus, fetches the data from the DRAM, 
and writes the data to the memory bus. The I/O bridge translates the memory bus 
signal into a system bus signal and passes it along to the system bus (Figure 6.7(b)). 
Finally, the CPU senses the data on the system bus, reads the data from the bus, 
and copies the data to register %rax (Figure 6.7(c)). 
Conversely, when the CPU performs a store operation such as 


movq 4rax,A 
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Figure 6.7 Register file 
Memory read transaction 
for a load operation: movq 
A, hraz. 





(a) CPU places address A on the memory bus. 
Register file 





VO 
bridge 






(b) Main memory reads A from the bus, retrieves word x, and places it on the bus. 


Register file 


uo 
bridge 





(c) CPU reads word x from the bus, and copies it into register Arax. 


where the contents of register %rax are written to address A, the CPU initiates 
a write transaction. Again, there are three basic steps. First, the CPU places the 
address on the system bus. The memory reads the address from the memory bus 
and waits for the data to arrive (Figure 6.8(a)). Next, the CPU copies the data in 
Yrax to the system bus (Figure 6.8(b)). Finally, the main memory reads the data 
from the memory bus and stores the bits in the DRAM (Figure 6.8(c)). 


6.1.2 Disk Storage 


Disks are workhorse storage devices that hold enormous amounts of data, on 
the, order of hundreds to thousands of gigabytes, as opposed to the hundreds or 
. thousands of megabytes in a RAM-based:memory. However, it takes on the order 
of milliseconds to read information from a disk, a hundred thousand times longer 
than from DRAM and a million times longer than from SRAM. 
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Register file 


(a) CPU places address A on the memory bus. Main memory reads it and waits for the data word. 


Register file 


(b) CPU places data word y on the bus. 


Register file 


— 
f 


(c) Main memory reads data word y from the bus and stores it at address A. 


Figure 6.8 Memory write transaction for a store operation: movq 4rax, A. 


Disk Geometry 


Disks are constructed from platters. Each platter consists of two sides, or surfaces, | 
that are coated with magnetic recording material. A rotating spindle in thè center 3 
of the platter spins the platter at a'fixed rotational rate, typically between 5,400 and. 1 
15,000 revolutions per minute (RPM). A disk will typically contain one or more of 
these platters encased in a sealed container. : 

Figure 6.9(a) shows the geometry of a typical disk surface. Each surface 
consists of a collection of concentric rings called tracks. Each track is partitioned ] 
into a collection of sectors. Each sector contains an equal number of data bits 1 
(typically 512 bytes) encoded in the magnetic material on the sector. Sectors are j 
separated by gaps where no data bits~are stored. Gaps store formatting bits that 4 
identify sectors. 
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(a) Single-platter view 


Figure 6.9 Disk geometry. 


A disk consists of one or more platters stacked on top of each other and 
encased in a sealed package, as shown in Figure 6.9(b). The entire assembly is 
often referred to as a disk drive, although we will usually refer to it as simply a 
disk. We.will sometimes refer to disks as rotating disks to distinguish them from 
flash-based solid state disks (SSDs), which have no moving parts. 

Disk manufacturers describe the geometry of multiple-platter drives in terms 
of cylinders, where a cylinder is the collection of tracks on all the surfaces that are 
equidistant from the center of the spindle. For example, if a drive has three platters 
and six surfaces, and the tracks on each surface are numbered consistently, then 
cylinder & is the collection of the six instances of track k. 


Disk Capacity 


The maximum number of bits that can be recorded by a disk is known as its max- 
imum capacity, or simply capacity. Disk capacity is determined by the following 
technology factors: 


Recording density (bits/in). The number of bits that can be squeezed into a 1- 
inch segment of a track. 


Track density (tracks/in). The number of tracks that can be squeezed into a 
l-inch segment of the radius extending from the center of the platter. 


Areal density (bits/in?). The product of the recording density and the track 
density. 


Disk manufacturers work tirelessly to increase areal density (and thus capac- 
ity), and this is doubling every couple of years. The original disks, designed in 
an age of low areal density, partitioned every track into the same number of sec- 
tors, which was determined by the number of sectors that could be recorded on 
the innermost track. To maintain a fixed number of sectors per track, the sectors 
were spaced farther apart on the, outer tracks. This was a reasonable approach 
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Aside How much isa gigabyte? 
Unfortynately, the meanings of prefixes such as kilo (K), mega (M), giga (G); and téra (T) depend 
on the context. For, measures that-relate.té the capacity of DRAMs and SRAMSs, typically K= 21, 
M=20 G 29 and T= 24. For'measures rélated to the capacity of WO dévices such as disks and 
networks, typically K= 10°, M= zio, G 210), and T = 107, Rates and throughputs usually" üse these 
prefix values as well. * a 
Fortunately, forthe back-of-the-envelope estimates that we typically rely on, either assumption 
works fine in practice. For, example, the relative difference between 2°°-and 10° is not that large: 
(230 — 109) /10? ~ 7%. Similarly, (220.— 1012) /1012 ~ 10%. 
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when areal densities were relatively low. However, as areal densities increased, 

the gaps between sectors (where no data bits were stored) became unacceptably 

large. Thus, modern high-capacity disks use a technique known as multiple zone  : 
recording, where the set of cylinders is partitioned into disjoint subsets known as $ 
recording zones. Each zone consists of a contiguous collection of cylinders. Each 

track in each cylinder in a zone has the same number of sectors, which is deter- . 
mined by the number of sectors that can be packed into the innermost track of — E 


the zone. | 
The capacity of a disk is given by the following formula: | 


Caney # bytes , average # sectors - #tracks  # surfaces - 3 platters 
p sector track surface platter disk 


| 
For example, suppose we have a disk with five platters, 512 bytes per sector, 20,000 i 
tracks per surface, and an average of 300 sectors per track. Then the capacity of | 


ar € et IT ee m is 


Notice that manufacturers express disk capacity in units of gigabytes (GB) or. | 
terabytes (TB), where 1 GB = 10? bytes and 1 TB = 10” bytes. i 
2 

a x eee | 


THE PIDEN WDR ade SEL XYSN ge 
What is the capacity of a disk with 2 platters, 10,000 laders an average of 400 34 
sectors per track, and 512 bytes per sector? A 


the disk is 

; 512 bytes | 300 sectors | 20,000 tracks 2 surfaces Splatters — 248 

2 Capacity = ————— x —————— x ————————x———x-————— a 
i sector track surface platter disk : 

| = 30,720,000,000 bytes 

i = 30.72 GB | 

n § y 












Disk Operation 


Disks read and write bits stored on the magnetic surface using a read/write head | ; 
connected to the end of an actuator arm, as shown in Figure 6.10(a). By moving | | 
| 


a RN ee Fili a ee SU e DDR PRA 
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The disk surface > ~ Read/write heads 
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the disk surface on 
a thih cushion of air. 

















Spindle 


By moving radially, the arm i ] 
can position the read/write (b) Multiple-platter view 
head over any track. 


1 


(a) Single-platter view 


Figure 6.10 Disk dynamics. 


the arm back'and forth along its radial axis, the drive can position the head over 
any track on the surface. This mechanical motion is known as a seek. Once the 
head is positioned over the desired track, then, as each bit ont'the track passes 
underneath, the head Can either sense the value of the bit (read the bit) or alter 
the value of the bit (write the bit). Disks with multiple platters have a separate 
read/write head for each surface, as shown in Figure 6.10(b). The heads are lined 
up vertically and move in unison. At any point in time, all heads are positioned 
on the same cylinder. 

The read/write head at'the end of the arm flies (literally) on a thin cushion of 
air over the disk surface at a height of about 0.1 microns and a speed of about 80 
km/h. This is analogous to placing a skyscraper on its side and flying it around the 
world at a height of 2.5 cm (1 inch) above the ground, with each orbit of the earth 
taking only 8 seconds! At these tolerances, a tiny piece of dust on the surface is like 
a huge boulder. If the head were to strike one of these boulders, the head would 
cease flying ànd crash into the surface (a so-called head crash). For this reason, 
disks are always sealed in airtight packages. 

Disks read and write data in sector-size blocks. The access time for a sector 
has three main components: seek time, rotational latency, and transfer time: 


Seek time. 'To read the contents of some target sector, the arm first positions the 
head over the track that contains the target sector. The time required to 
move the arm is called the seek time. The seek time, Tee,, depends on the 
previous position of the head and the speed that the arm moves across the 
surface. The average seek time in modern drives, Tavg seek» measured by 
taking the mean of several thousand seeks to random sectors, is typically 
on the order of 3 to 9 ms. The maximum time for a single seek, Trax seek» 
can be as high as 20 ms. 
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Rotational latency. Once the head is in position over the track, the drive waits 
for the first bit of the target sector to pass under the head. The perfor- 
mance of this step depends on both the position of the surface when the 
head arrives at the target track and the rotational speed of the disk. In the 
worst case, the head just misses the target sector and waits for the disk to 
make a full rotation. Thus, the maximum rotational latency, in seconds, is 
given by 


1 60 secs 
Tmax rotation = RPM x 





1 min 
The average rotational latency, Tavg rotation» 18 simply half of Tmax rotation: 


Transfer time, When the first bit of the target sector is under the head, the drive 
can begin to read or write the contents of the sector. The transfer time 
for one sector depends on the rotational speed and the number of sectors 
per track. Thus, we can roughly estimate the average transfer time for one 
sector in seconds as 





$ t d : SEM ee z 60 secs 
avgtransfer RPM (average # sectors/track) 1min 


We can estimate the average time to access the contents of a disk sector as 
the sum of the average seek time, the average rotational latency, and thẹ average 
transfer time. For example, consider a disk with the following parameters: 


Parameter Value 
eae 

Rotational rate 7,200 RPM 

Tayg seek 9 ms ` 


Average number of sectors/track 400 
For this disk, the average rotational latency (in ms) is 


Tavg rotation = 1/2 X Tmax rotation 
= 1/2 x (60 secs/7,200 RPM) x 1,000 ms/sec 
~4ms 
The average transfer time is 
Tavg transfer = 60/7,200 RPM x 1/400 sectors/track x 1,000 ms/sec 
=~ 0.02 ms 


Putting it all together, the total estimated access time is 


Taccess = avg seek T Tavg rotation T Tavg transfer 


= 9 ms + 4 ms + 0.02 ms 
= 13.02 ms 
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This example illustrates some important points: 


* The time to access the 512 bytes in a disk sector is dominated by the seek time 
and the rotational latency. Accessing the first byte in the sector takes a long 
time, but the remaining bytes are essentially free. 


* Since the seek time and rotational latency are roughly the same, twice the 
seek time is a simple and reasonable rule for estimating disk access time. 






























* The access time for a 64-bit word stored in SRAM is roughly 4 ns, and 60 ns 
for DRAM. Thus, the time to read a'512-byte sector-size block from memory 
is roughly 256 ns for SRAM and 4,000 ns for DRAM. The disk access time, 
roughly 10 ms, is about 40,000 times greater than SRAM, and about 2,500 
times greater than DRAM. 


[3 ^ i iP hd: VG ENNEA NITOS uot itio acht gj at ej atl rette obj t Manet AP PHP ett 
[Practice Problem 6.3 (solution page 661)......... . 
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Estimate the average time (in ms) to access a sector on the following disk: 


Parameter Value 
Rotational rate 15,000 RPM 
Tavg seek 8 ms 


Average number of sectors/track 500 


Logical Disk Biocks 


As we have seen, modern disks have complex geometries, with multiple surfaces 
and different recording zones on those surfaces. To hide this complexity from 
the operating system, modern disks present a simpler view of their geometry as 
a sequence of B sector-size logical blocks, numbered 0, 1,..., B — 1. A small 
hardware/firmware device in the disk package, called the disk controller, maintains 
the mapping between logical block numbers and actual (physical) disk sectors. 

When the operating system wants to perform an I/O operation such as reading 
a disk sector into main memory, it sends a command to the disk controller asking 
it to read a particular logical block number. Firmware on the controller performs 
a fast table lookup that translates the logical block number into a (surface, track, 
sector) triple that uniquely identifies the corresponding physical sector. Hardware 
on the controller interprets this triple to move the heads to the appropriate : 
cylinder, waits for the sector to pass under the head, gathers up the bits sensed 
by the head into a small memory buffer on the controller, and copies them into 
main memory. 


Practice Problem 6.4 (s lution page 661), erie cen 


uppose that a 1 MB file consisting of 512-byte logical blocks is stored on a disk 
rive with the following characteristics: 
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Aside Formatted disk capacity. 


Before a disk can b& used to store data, it must be formatted by the disk controller? This involves filling 
im the gaps | between'sectors with information’that identifies i the sectors, identifying any cylinders "with 
surface defects and taking them out of action, and setting “aside a set of cylinders in each Zone as spares 
‘that can be called intó action if one or moré cylinders i in-the zone góes bad-during the lifetime of the 
disk. The formatted capütity quoted by disk Franufactureis d is less than the maxiinunt capacity because 


of the existence of these spare cylinders’ , a” ~ 4, Pos 

Z nok. €——M P ih yai ea s KA e ie odie. Ao Coe PELA E PAE E seiu tis 
Parameter Value 
Rotational rate 10,000 RPM 
Tavg seek 5ms 
Average number of sectors/track 1,000 t 
Surfaces 4 
Sector size 512 bytes 


t 

For each case below, suppose that a program reads the logical blocks of the 

file sequentially, one after the other, and that the time to position the heád oyer 
the first block is Tavg seek + Tavg rotation: 


A. Best case: Estimate the optimal time (in ms) required to read the file given 
the best possible mapping of logical blocks to disk sectors (i.e., sequential). 


B. Random case: Estimate the time (in ms) required to' read the file if blocks 
are mapped randomly to disk sectors. 


Connecting I/O Devices 


Input/output (I/O) devices such as graphics cards, monitors, mice, keyboards, and 
disks are, connected to the CPU and main memory using an //O bus. Unlike the 
system bus and memory buses, which are CPU-specific, I/O buses are designed 
to be independent of the underlying CPU. Figure 6.11 shows a representative I/O 
bus structure that connects the CPU, main memory, and I/O devices. 

Although the I/O bus is slower than the system and memory buses, jt can 
accommodate a wide variety of third-party I/O devices. For example, the bus,in 
Figure 6.11 has three different types of devices attached to it. 


* A Universal Serial Bus (USB) controller is a conduit for devices attached to 
a USB bus, which is a wildly popular standard for connecting a variety of 
peripheral I/O devices, including keyboards,'mice, modems, digital cameras, 
game controllers, printers, external disk drives, and solid state disks USB 
3.0 buses have a maximum bandwidth of 625 MB/s. USB 3.1 buses have a 
maximum bandwidth of 1,250 MB/s. 
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Figure 6.11 CPU 
Example bus structure 
that connects the CPU, 
main memory, and I/O 
devices. 
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* A graphics card (or adapter) contains hardware and software logic that is re- 
sponsible for painting the pixels on the display monitor on behalf of the CPU. 


* A host bus adapter that connects one or more disks to the I/O bus using 
a communication protocol defined by a particular host bus interface. 'The 
two most popular such interfaces for disks are SCSI (pronounced “scuzzy”) 
and SATA (pronounced "sat-uh"). SCSI disks are typically faster.and more 
expensive than SATA drives. A SCSI host bus adapter (often called a SCSI 
controller) can support multiple disk drives, as opposed to SATA adapters, 
which can only support one drive, 


Additional devices such as network adapters can be attached to the I/O bus by 
plugging the adapter into empty expansion slots on the motherboard that provide 
a direct electrical connection to the bus: 


Accessing Disks 


While a detailed description of how VO-devices work and how they are pro- 
grammed is outside our scope here; we can, giveyou a general idea. For example, 
Figure 6.12 summarizes the Steps:that take place when a CPU reads data from a 
disk. 
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t Aside Advances in !/O bus designs T 





The I/O bus in Figure 6.11 is a simple abstraction that allows us to be concréte, without being tied too 
closely to the details of any specific system. It js based on the peripheral component interconnect (PCI) 
bus, which was popular until around 2010. In the PEI niddel, each device in the system shares the bus, 
and only one device at a time can access these wires. In modern systems, the shared PCI bus has been 
| replaced by a PCI express (PCIe) bus, which is a set of high-speed serial, point-to-point links connected 
by switches, akin to the switched Ethernets that you will learn about in Chapter 11. A PCle bus, with a 

maximum throughput of 16 GB/s, is an orderof magnitude faster than,a PCI bus, which has a maximum 
throughput of 533 MB/s. Except for,measured I/O pérformance, the differences between the different 
í bus designs,are not visible to application programs, so we will use the simple shared bus abstraction 
" throughout ihe text. 
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The CPU issues commands to I/O devices using a technique called memory- 
mapped I/O (Figure 6.12(a)). In a system with memory-mapped I/O, a block of 
addresses in the address space is reserved for communicating with I/O devices. 
i Each of these addresses is known as an V/O port. Each device is associated with 

(or mapped to) one or more ports when it is attached to the bus. 

As a simple example, suppose that the disk controller is mapped to port Oxa0. 
| Then the CPU might initiate a disk read by executing three store instructions to 
| address Oxa0: The first of these instructions sends a command word that tells the 
disk to initiate a read, along with other parameters such as whether to interrupt 
the CPU when the read is finished. (We will discuss interrupts in Section 8.1.) The 
second instruction indicates the logital block number that should be read. 
: The third instruction indicates the main memory address where the contents of 
the disk sector should be stored. 

After it issues the request, the CPU will typically do other work while the 
disk is performing the read. Recall that a 1 GHz processor with a 1 ns clock cycle 
can potentially execute 16 million instructions in the 16 ms it takes to read the 
disk. Simply waiting and doing nothing while the transfer is taking place would be 
enormously wasteful. 

After the disk controller receives the read command from the CPU, it trans- 
lates the logical block number to a sector address, reads the contents of the sector, 
and transfers the contents directly to main memory, without any intervention from $ 
the CPU (Figure 6.12(b)). This process, whereby a device performs a read or write 
bus transaction on its own, without any involvement of the CPU, is known as direci 
memory access (DMA). The transfer of data is known as a DMA transfer. 

After the DMA transfer is complete and the contents of the disk sector are 
safely stored in main memory, the disk controller notifies the CPU by sending an 
interrupt signal to the CPU (Figure 6.12(c)). The basic idea is that an interrupt GE 
signals an external pin on the CPU chip. This causes the CPU to stop what it is BIB 
currently working on and jump to an operating system routine. The routine records 
the fact that the I/O has finished and then returns control to the point where the 
CPU was interrupted. 





Figure 6.12 
Reading a disk sector. 
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(a) The CPU initiates a disk read by writing a command, logical block number, and 
destination memory address to the memory-mapped address associated with the disk. 
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T ow a 
(b) The disk controller reads the sector and performs a DMA transfer into main memory. 
CPU chip 





Mouse Keyboard Monitor 
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Aside Characteristics of-a commercial disk-drive : 
t . $ if 
1 Disk manufacturers publish 4 lot of useful high-level technical information on their Web sites. For 
E example, the Seagate; Web site contains the following information (and much more!) about one of 
| their popular drives, the Barracuda 7400. (Seagate.com) 
Geometry characteristic Value Geometry characteristié . .. Value 
>. TT ————————À——À—À——— 
Surface diameter 3.5 in Rotationalrate 7,200 RPM. 
Formatted'capacity 3 TB , Average rotational latency »  416ms 
Platters 3 x Ayerage seek time 8,5 ms 
: Surfaces 6 : Track-to-track seek time - — 10ms 
; Logicål blocks 5,860,533,168 Average transfer rate 156 MB/s 
| Logical, block size „ 512, bytes PS Maximum sustained transfer rate 210 MB/s 
| Fr 9 Vet Ge 2o% x ee wd Ng Pn » & @ 
| Figure 6.13 Vo Dus 
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6.1.3 Solid State Disks 


i A solid state disk (SSD) is a storage technology, based on flash memory (Sec- 
tion 6.1.1), that in some situations is an attractive alternative to the conventional | 
rotating disk. Figure 6.13 shows the basic idea. An SSD package plugs into a stan- | 
dard disk slot on the I/O bus (typically USB or SATA) and behaves like any other § 
disk, processing requests from the CPU to read and write logical disk blocks. An 
i SSD package consists of one or more flash memory chips, which replace the me- 
chanical drive in a conventional rotating disk, and a flash translation layer, which 
is a hardware/firmware device that plays the same role as a disk controller, trans- 
lating requests for logical blocks into accesses of the underlying physical device. 4 
Figure 6.14 shows the performance characteristics of a typical SSD. Notice that 4 
i reading from SSDs is faster than writing. The difference between random reading 
and writing performance is caused by a fundamental property of the underlying 
flash memory. As shown in Figure 6.13, a flash memory consists of a sequence of B | 
: blocks, where each block consists of P pages. Typically, pages are 512 bytes to 4 KB 
in size, and a block consists of 32--128 pages, with total block sizes ranging from 161 
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i Reads: Writes 
Sequential read throughput 550 MB/s Sequential write throughput 470 MB/s 
Random read throughput (IOPS) 89,000 IOPS Random write throughput (IOPS) 74,000 IOPS 
Random read throughput (MB/s) 365 MB/s Random write throughput (MB/s) 303 MB/s 
Avg. sequential read access time 50 us Avg. sequential write access time 60 us 


Figure 6.14 Performance characteristics of a commercial solid state disk. Source: Intel SSD 730 product 
specification [53]. QPS is I/O operations per second. Throughput numbers are based on reads and writes of 
4 KB blocks. (Intel’ SSD 730 product Specification. Intel Corporation. 52.) 


KB to 512 KB. Data are read and written in units of pages. A page can be written 
only after the entire block to which it belongs has been erased (typically, this means 
that all bits in the block are set to 1). However, once a block is erased; each page 
in the block can be written once with no further erasing. A block wears out after 
roughly 100,000 repeated writes. Once a block wears out, it can no longer be used. 

Random writes are slower for two reasons. First, erasing a block takes a 
relatively long time, on the order of 1 ms, which is more than an order of magnitude 
longer than it takes-to access a page. Second, if a write operation attempts to 
modify a page p that cóhtains existing data (i.e., not all ones), then any pages in 
the same block with useful data must be copied to a new (erased) block before 
the write to page p can occur. Manufacturers have developed sophisticated logic 
in the flash translation layer that attempts to amortize the high cost of erasing 
blocks and to minimize the number of internal copies.on writes, but it is unlikely 
that random writing will ever perform as well as reading. 

SSDs have a number of advantages over rotating disks. They are built of 
semiconductor memory, with no moving parts, and thus have much faster random 
access times than rotating disks, use less power, and are moré rugged. However, 
there are some disadvantages. First, because flash blocks wear out after repeated 
writes, SSDs have the potential to^wear out as well. Wear-leveling logic in the flash 
translation layer attempts to maximize the lifetime of each block by spreading 
erasures evenly across all blocks. In practice, the wear-leveling logic is so good 
that it’takes many years for SSDs to wear out (see Practice Problem 6.5). Second, 
SSDs are about 30 times more expensive per byte than rotating disks, and thus the 
typical'storage capacities are significantly less than rotating disks. However, SSD 
prices are decreasing rapidly as they become tnore popular, and the gap between 
the two is decreasing. 

SSDs have completely replaced rotating disks in portable music devices, are 
popular as disk replacements in laptops, and have even begun to appear in desk- 
tops and servers. While rotating disks are'here to stay, it is clear that SSDs are an 
important alternative. 






(6.5 (solution ua 
0.2. (solution page 





actice.Problem.6 ee ir. Mop a p te 
As we shave seen, a spolential drawback of SSDs is that the underlying flash memory 
can wear out. For example, for the SSD in Figure 6.14, Intel guarantees about 
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128 petabytes (128 x 101? bytes) of writes before the drive wears out. Given 
this assumption, estimate the lifetime (in years) of this SSD for the following 
workloads: 
A. Worst case for sequential writes: The SSD is written to continuously at a rate 
of 470 MB/s (the average sequential write throughput of the device). 
B. Worst case for random writes: The SSD is written to continuously at a rate 
of 303 MB/s (the average random write throughput of the device). 
C. Average case: The SSD is written to at a rate of 20 GB/day (the average 
daily write rate assumed by some computer manufacturers in their mobile 
computer workload simulations). 





6.1.4 Storage Technology Trends 


There are several important concepts to take away from our discussion of storage 
technologies. 

Different storage technologies have different price and performance trade-offs 
SRAM is somewhat faster than DRAM, and DRAM is much faster than disk. On 
the other hand, fast storage is always more expensive than slower storage. SRAM 
costs more per byte than DRAM. DRAM costs much more than disk. SSDs split 
the difference between DRAM and rotating disk. 

The price and performance properties of different storage technologies are 
changing at dramatically different rates. Figure 6.15 summarizes the price and per- 
formance properties of storage technologies since 1985, shortly after the first PCs 
were introduced. The numbers were culled from back issues of trade magazines 
and the Web. Although they were collected in an informal] survey, the numbers 
reveal some interesting trends. 

Since 1985, both the cost and performance of SRAM technology have im- 
proved at roughly the same rate. Access times and cost per megabyte have de- 
creased by a factor of about 100 (Figure 6.15(a)). However, the trends for DRAM 
and disk are much more dramatic and divergent. While the cost per megabyte of 
DRAM has decreased by a factor of 44,000 (more than four orders of magnitude!), 
DRAM access times have decreased by only a factor of 10 (Figure 6.15(b)). Disk 
technology has followed the same trend as DRAM and in even more dramatic 
fashion. While the cost of a megabyte of disk storage has plummeted by a factor 
of more than 3,000,000 (more than six orders of magnitude!) since 1980, access | 
times have improved much more slowly, by only a factor of 25 (Figure 6.15(c)). ME 
These startling long-term trends highlight a basic truth of memory and disk tech- 
nology: it is much easier to increase density (and thereby reduce cost) than to ME 
decrease access time. XE 

DRAM and disk performance are lagging behind CPU performance. As wesec 34 
in Figure 6.15(d), CPU cycle times improved by a factor of 500 between 1985 and 1 
2010. If we look at the effective cycle time —which we define to be the cycle time $ 
of an individual CPU (processor) divided by the number of its processor. cores— ME 
then the improvement between 1985 and 2010 is even greater, a factor of 2,000. 2M 
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Metric 1985 1990 | 1995 | 2000 2005 2010 2015 2015:1985 


core. The Core i7 circa 2015 uses the Haswell core. 


The split in the CPU performance curve around 2003 reflects the introduction 
of multi-core processors (séeraside on page 605). After this split, cycle. times of 
individual cores actually increased a bit before starting to decrease again, albeit 
at a slower rate than before. 

Note that while SRAM performance lags, it is roughly keeping up. However, 
the.gap between DRAM and disk performance and CPU performance is actually 
widening. Until the advent of multi-core processors around 2003, this performance 
gap was.a function of latency, with DRAM’ and disk access times decreasing 
more slowly than the cycle time of an individual processor. However, with the 
introduction of multiple cores, this performance gap is increasingly a function of 
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$/MB 2,900 320 256 100 75 60 25 116 
Access (ns) 150 35 15 3 2 1.5 13 115 
(a) SRAM trends 
Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 
$/MB 880 100 30 1 0.1 0.06 0.02 44,000 
Access (ns) 200 100 70 60 50 40 20 10 
Typical size (MB) 0.256 4 16 64 2,000 8000 16,000 62,500 
(b) DRAM trends 
Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 
$/GB 100,000 8,000 300 10 5 0.3 0.03” 3,333,333 
Min. seek time (ms) 75 28 10 8 5 3 3 25 
Typical size (GB) 0.01 0.16 1 20 160 1,500 3,000 300,000 
(c) Rotating disk trends 
Metric 1985 1990 1995 2000 2003: 2005 2010 2015 2015:1985 
Intel CPU ' 80286 80386 Pent. P-III Pent4 Core2 Corei7(n) Corei7 (h) — 
Clock rate (MHz) 6 20 150 600 3,300 2,000 2,500 3,000 500 
Cycle time (ns) 166 50 6 1.6 0.3 0.5 0.4 0.33 500 
Cores 1 1 1 1 1 2 4 4 4 
Effective cycle 166 50 6 1.6 0.30 0.25 0.10 0.08 2,075 
time (ns) 
I 
(d) CPU trends 


Figure6.15 Storage and processing technology trends. The Core i7 circa 2010 uses the Nehalem processor 
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Figure 6.16 The gap between disk, DRAM, and CPU speeds. 


r 


throughput, with multiple processor cores issuing requests to the DRAM and disk 
in parallel. 

The various trends are shown quite clearly in Figure 6.16, which plots the 
access and cycle times from Figure 6.15 on a semi-log scale. 

As we will see in Section 6.4, modern computers make heavy use of SRAM- 
based caches to try to bridge the processor-memory gap. This approach works 
because of a fundamental property of application programs known as locality, 
which we discuss next. : 


aR aR, E entr etant Nor auae py RE wt m 
SRO eT 


Using the years 2005 to 2015 in Figure 6.15(c), estimate the year 3 
when you will be able to buy a petabyte (1015 bytes) of rotating disk storage for 3 
$500. Assume actual dollars (no inflation). 


6.2 Locality 


Well-written computer programs tend to exhibit good /ocality. That is, they tend 4 
to reference data items that are near other recently referenced data items or | 
that were recently referenced themselves. This tendency, known as the principle à 
of locality, is an enduring concept that has enormous impact on the design and į 
performance of hardware:and software systems. 

Locality is typically described as having two distinct forms: temporal locality j 
and spatial locality. Yn a program'with good temporal locality, a memory location 
that is referenced once is likely to be referenced again multiple times in the near 
future. In a program with good spatial locality; if a memory location is referenced 1 
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Aside When cycle time stood still: The advent of multi-core processors ? 


The history of computers is marked by some singular events that caused profound changes in the 
industry and the world. Interestingly, these inflection points tend to occur about once per decade: the 
development of Fortran in the 1950s, the introduction of the IBM 360 in the early 1960s, the dawn of 
the Internet (then called ARPANET) in the early 1970s, the introduction of the IBM PC in the early 
1980s, and the creation of the World Wide Web in the early 1990s. ] e 
The most recent such event occurred early in the 21st century, when computer manufacturers 
ran headlong into the so-called power wall, discovering that they could no longer increase CPU clock j 
frequencies as quickly because the chips would then consume too much power. The solution was to 
improve performance by replacing a single large processor with multiple smaller processor cores, each 
acomplete processor capable of executing programs iridependently and in parallel with the other cores. i 
This multi-core approach works in part because the powef consumed'by'a processor is ‘proportional to 
P = f GV?, where f is thexclock frequericy*C is the capacitance, and V is the voltage. The capacitance 
C is roughly proportional to the area, so the power drawn by multiple cores can be held constant as long 
as the total area of the cores is constant. As long as feature sizes continue to shrink at the exponential | 
Moore’s Law rate, the number of cores in each processor, and thus its effective performance, will f 
continue:to increase. Y i 
From this póint forward, Computers will get faster not'becàuse the clock frequency increases but 
because the number of cores in each processor increases, and because Architectural innovations increase 
the effiéiency'of programs rurining on those cores. We cari 'se& thil trend-cléarly in Figure 6.16. CPU i 
cycle timereached its lowest point in 2003-and then actually started’ to risé-befdre‘leveling off and I 
starting to decline again at a slower rate than before. However, because of the advent of multi-core | 
processors (dual-core in 2004 and quad-core in 2007), the See éyclé time continues to decrease at 


close-to its previous rate. 4 i 


a ^ “ft 


once, then the program is likely to reference a nearby memory location in the near 
future, 

Programmers should understand the principle of locality because, in general, 
programs with good locality run faster than programs with poor locality. All levels 
of modern computer systems, from the hardware, to the operating system, to 
application programs, are designed to exploit locality. At the hardware level, the 
principle of locality allows computer designers to speed up main memory accesses 
by introducing small fast memories known as cache memories that hold blocks of 
the most recently referenced instructions and data items. At the operating system 
level, the principle of locality allows the system to use the main memory as a cache 
of the most recently referenced chunks of the virtual address space. Similarly, the 
operating system uses main memory to cache the most recently used disk blocks in 
the disk file system. The principle of locality also plays a crucial role in the design 
of application programs. For example, Web browsers exploit temporal locality by 
caching recently referenced documents on a local disk. High-volume Web servers 
hold recently requested documents in front-end disk caches that satisfy requests 
for these documents without requiring any intervention from the server. 
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1 int sumvec(int v{N]) TY 
2 t Address 0 4 8 12 16 20 24 28 
3 int i, sum.= 0; Contents Ug vy v2 V3 v4 Us U6 v7 
4 Access order 1 2 3 4 5 6 7 8 
5 for (i I 0; i< N; itt) Orv’ arr 
6 sum += v[i]; (b) 

y return sum; 

8 +} 

(a) 


Figure 6.17. (a) A function with good locality. (b) Reference pattern for vector v (N — 8). Notice how 
the vector elements are accessed in the same order that they are stored in memory. 


6.2.1 Locality of References to Program Data 


Consider the simple function in Figure 6.17(a) that sums the elements of a vector. 
Does this function have, good locality? To answer this question, we look at the 
reference pattern for each variable. In this example, the sum variable is referenced 
once in each loop iteration, and thus there is good temporal locality with respect 
to sum. On the other hand, since sum is a scalar, there is no spatial locality with 
respect to sum. 

As we see in Figure 6.17(b), the elements of vector v are read sequentially, one 
after the other, in the order they are stored in memory (we assume for convenience 
that the array starts at address 0). Thus, with respect to variable v, the function 
has good spatial locality but poor temporal locality since each vector element 
is accessed exactly once. Since the function has either good spatial or temporal 
locality with respect to each variable in the loop body, we can conclude that the 
sumvec function enjoys good locality. 

A function such as sunvec that visits each element of a vector sequentially 
is said to have a stride-1 reference pattern (with respect to the element size). 
We will sometimes refer to stride-1 reference patterns as sequential reference 
patterns. Visiting every kth element of a contiguous vector is called a stride-k 
reference pattern. Stride-1 reference patterns are a common andimportantsource 1 
of spatial locality in programs. In general, as the stride increases, the spatial locality 
decreases. 

Stride is also an important issue for programs that reference multidimensional 4 
arrays. For example, consider the sumarrayrows function in Figure: 6.18(a) that 
sums the elements of a two-dimensional array. : 

The doubly nested loop reads the elements of the array in row-major order. } 
That is, the inner loop reads the elements of the first row, then the second row, i 
and so on. The sumarrayrows function enjoys good spatial locality because it | 
references the array in the same row-major order that the array is stored (Fig- 1 
ure 6.18(b)). The result is a nice stride-1 reference pattern with excellent spatial | 
locality. 
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1 ipt sumarrayrows(int a[M] [N}) 

ee 
2 t1 Address 0 4 8 12 16 20 
? intei, j, sum = 0; Contents 40 09 a 0 i 12 
4 
5 for (i = 0; i < M; i++) Access order 1 2 3 4 5 6 
6 for (j = 0; j < N; j++) (b) 
7 sum += a[i] [j]; 
8 return sum; 
9 py 
(a) 


Figure 6.18 (a) Another function with good locality. (b) Reference pattern for array a (M =2, N —3). 
Thete is good spatial locality because the array is accessed ith (He same row-major order in which it is stored 
in memory. 


LY 





1 int sumarraycols (int a [M] [N]) 

2 t1 Address 0 4, 8 12 16 20 
: int i, J: sum = 0; Contents a * i ap aig 11 12 
5 for (j = 0; j < N; j++) Accessorder 1 3 5 2 4 6 
6 for (i = 01i < M; i++) (b) 

7 sum +z alil[j]; 

8 return sum;, 

3 } 

(a) 


al r 
Figure 6.19 (a) A function with poor spatial locality. (b) Reference pattern for array a (M =2, N =3). 
The function has poor spatial locality because it scans memory with a stride-N reference pattern. 


Seemingly trivial changes to a program can have a big impact on its locality. 
For example, the sumarraycols function in Figüre 6.19(a) computes the same 
result as the sumarrayrows function in Figure 6.18(a). The only difference is that 
we have interchanged the i and j loops. What impact does interchanging the loops 
have on its locality? ‘ 

The sumarraycols function suffers from poor spatial locality because it scans 
the array column-wise instead of row-wise. Since C arrays are laid out in memory 
TOW-wise, the result is a stride-N reference pattern, as shown in Figure 6.19(b). 


6.2.2 Locality of Instruction Fetches 


Since program instructions are stored in memory and must be fetched (read) 
by the CPU, we can also evaluate the locality of a program with respect to its 
instruction fetches. For example, in Figure 6.17 the instructions in the body of the 
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for loop are executed in sequential memory order, and thus the loop enjoys good 
spatial locality. Since the loop body is executed multiple times, it also enjoys good 
temporal locality. E 

An important property of code that distinguishes it from program datais 4 i 
that it is rarely modified at run time. While a program is executing, the CPU § 
reads its instructions from memory. The CPU rarely overwrites or modifies these 1 | 


instructions. 


6.2.3 Summary of Locality 





In this section, we have introduced the fundamental idea of locality and have ! fi, 
identified some simple rules for qualitatively evaluating the locality in a program: 4 


* Programs that repeatedly reference the same variables enjoy good temporal 1 i 
locality. 

* For programs with stride-k reference patterns, the smaller the stride, the — 228 

F better the spatial locality. Programs with stride-1 reference patterns havegood j | | 

spatial locality. Programs that hop around memory with large strides have 3E 

poor spatial locality. 1 | 

* Loops have good temporal and spatial locality with respect to instruction Wr. 

fetches. The smaller the loop body and the greater the number of loop it- 3 

erations, the better the locality. 1 | 





Later in this chapter, after we have learned about cache memories and how f 

they work, we will show you how to quantify the idea of locality in terms of cache | | 

hits and misses. It will also become clear to you why programs with good locality 

typically run faster than programs with poor locality. Nonetheless, knowing how to 
glance at a source code and getting a high-level feel for the locality in the program į l 
is a useful and important skill for a programmer to master. 


i Permute the loops in the following function so that it scans the three-dimensional |: 
array a with a stride-1 reference pattern. | 
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1 int sumarray3d(int a[N] [N] [N]) l 
2 4 a 
1 3 int i, j, k, sum = 0; n 
| | 
j 5 for (i = 0; i < N; i++) { ) 
6 for (j 30; j < N; j+) ( ‘ 
7 for (k = 0; k < N; k++) { À | 
8 sum += a[x] [i] Ej] ; In 
r 9 } ; 
10 } i 
11 } 
12 return sum; ' | | 
3 3 db 
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(a) An array of structs (b) The cleari function 
1 #define N 1000 1 void cleari(point *p, int n) 
2 2 1 
3 typedef struct { 3 int i, j; 
4 int velí3]; 5 4 
5 int acc([3]; 5 for (i = 0; i < n; i++) { 
6 } point; 6 for (j = 0; j < 3; j++) 
7 7 pli].vel(j] = 
8 point p[N]; 8 for (j = 0; j < 3; j++) 
9 plil.acc(j] = 
10 } 

2 11 } 
(c) The clear2 function (d) The clear3 function 
1 wold clear2 (point "p, int n) 1 void clear3(point *p, int n) 
2 (t 2 1 
3 int i, j; 3 int i, j; 
4 4 
5 for (i = 0; i <n; i++) { 5 for (j = 0; j < 3; j++) { 
6 for (j = 0; j < 3; j++) X 6 for (i = 0; i < n; itt) 
7 p(i].vel(jl = 0; 7 p(i).vei[jl = 0; 
E plij.acc({j] ‘= 8 for (i = 0; i < n; itt) 
9 3 l 9 plil.acc[jl = 0; 
10 } 10 } 
n } n } 


Figure 6.20 Code examples for Practice Problem 6.8. 





The threé functions in Figure 6. 20 partir the same operation with v varying de- 
grees of spatial locality”Rank-order the fuhctionis with respect to the spatial local- 
ity ‘enjoyed by each. Explain how you arrived at your ranking. 


6.3 The Memory Hierarchy 


Sections 6.1 and 6.2 described some fundamental and enduring properties of 
storage technology and computer software: 


Storage technology: Different storage technologies have widely different access 


times. Faster technologies cost more per byte than slower ones and Have 
léss capacity. The Bap between CPU and main memory spéed is widening. 


Computer software. Well-written programs tend to exhibit goad locality. 




















610 Chapter 6 The Memory Hierarchy 


LO: 
Smaller, CPU registers hold words 


faster, | retrieved from cache memory. 
and L1: / L1 cache 
costlier (SRAM) L1 cache holds cache lines 
(per byte) retrieved from L2 cache. 
storage L2: L2 cache 
dèvi 
anes (SRAM) | L2 cache holds cache lines 


retrieved from L3 cache. 
L3 cache 


(SRAM) 
Larger, | 


= wf "SN 

cheaper (DRAM) Main memory holds disk blocks 

(per byte) retrieved from local disks. 

storage L5: Local secondary storage 

devices (local disks) Local disks hold files 
retrieved from disks on 


L6: Remote secondary storage remote network servers. 
(distributed file systems, Web servers) 


L3 cache holds cache lines 
retrieved from memory. 


Figure 6.21 The memory hierarchy. 


In one of the happier coincidences of computing, these fundamental properties of 
hardware and software complement each other beautifully. Their complementary 
nature suggests an approach for organizing memory systems, known as the mem- 
ory hierarchy, that is used in all modern computer systems. Figure 6.21 shows a 
typical memory hierarchy. 

In general, the storage devices get slower, cheaper, and larger as we move 
from higher to lower levels. At the highest level (LO) are a small number of fast 
CPU registers that the CPU can access in a single clock cycle. Next are one or 
more small to moderate-size SRAM-based cache memories that can be accessed  . 
in a few CPU clock cycles. These are followed by a large DRAM-based main 4 
memory that can ‘be accessed in tens to hundreds of clock cycles. Next are slow 
but enormous local disks. Finally, some systems even include an additional level 
of disks on remote servers that can be accessed over a network. For exaniple, 
distributed file systems such as the Andrew File System (AFS) or the Network 
File System (NFS) allow a program to access files that are stored on remote j 
network-connected servers. Similarly, the World Wide Web allows programs to | 


access remote files stored on Web servers anywhere in the world. 
« 


6.3.1 Caching in the Memory Hierarchy 


In general, a cache (pronounced “cash”) is a small, fast storage device that acts as j 
a staging area for the data objects stored in a larger, slower device. The process of 1 
using a cache is known as caching (pronounced “cashing”). 
The centralidea of a memory hierarchy is that for each k, the faster and smaller 
storage device at level k serves as a cache for the larger and slower storage device 
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E We have Shown you oné exárnple ofa „mémory“ ‘hierarchy, bütfütlir combinations are possible, and 
d indeed’ 'cominos. For example, sthahy sites; includitg~ "Google datatentérs, back "up local disks onto’ 
archival "Inágnétic tapes. At some of thése’ sites; human opéraidrs mantially-mountthe tapes onto tape 
* «drives'ds "heedéd. "At other sites; fape robots handle this task-automatically: In-€ither case, the collection 
of tapes représenis å level in the” üjériory hierarchy; »below'the “localtdisktleVel, and the same genéral 
t principled apply: ‘Tapes'are cheaper pei Byte than disks, which allówssites to árchive. multiple snapshots 
oftheir 46CÀ] disks. “The trade-off i$ that tapes*take, longer’ta, aécéss than disks. AS anothér example, 
by Solid: State disks’ are playing an increasingly important role inthe memory hierarchy, bridging the gulf 
| between DRAM'and pene disk. * wes xa Oa ts , 
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Smaller, faster, more expensive 
device at level k caches a 
subset of the blocks from level k4 1. 







Level k: 


Data are copied between 
levels in block-size transfer units. 


r Larger, slower, cheaper storage 
device at level K- 1 is partitioned 
into blocks. 


Level ke 1: P. 


Figure 6.22 The basic principle of caching in a memory hierarchy. 


at level k + 1. In other words, each level in the hierarchy caches data objects from 
the next lower level. For example, the local disk serves as a cache for files (such 
as Web pages) retrieved from remote disks over the'network, the main memory 
serves as a cáche for data on the local disks, and so on, until we get to the smallest 
cache of all, the set of CPU registers. 

Figure 6.22 shows the general concept of caching in a memory hierarchy. The 
storage at level k -- 1 is partitioned into contiguous chunks of data objects called 
blocks. Each block has a unique address or name that distinguishes it from other 
blocks. Blocks can be either fixed size (the usual case) or variable size (e.g., the 
remote HTML files stored on Web servers). For example, the level k + 1 storage 
in Figure 6.22 is partitioned into 16 fixed-size blocks, numbered 0 to 15. 

Similarly, the storage at level & is partitioned into a smaller set of blocks that 
are the same size as the blocks at level k + 1. At any point in time, the cache at 
level k contains copies of a subset of the blocks from level & + 1. For example, in 
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Figure 6.22, the cache at level & has room for four blocks and currently contains 
copies of blocks 4, 9, 14, and 3. 

Data are always copied back and forth between level k and level k +1 in 
block-size transfer units. It is important to realize that while the block size is fixed 
between any particular pair of adjacent levels in the hierarchy, other pairs of levels 
can have different block sizes. For example, in Figure 6.21, transfers between L1 
and LO typically use word-size blocks. Transfers between L2 and L1 (and L3 and 
L2, and L4 and L3) typically use blocks of tens of bytes. And transfers between L5 
and LA use blocks with hundreds or thousands of bytes. In general, devices lower 
in the hierarchy (further from the CPU) have longer access times, and thus tend 
to use larger block sizes in order to amortize these longer access times. 


Cache Hits 


When a program needs a particular data object d from level k + 1, it first looks 
for d in one of the blocks currently stored at level k. If d happens to be cached 
at level k, then we have what is called a'cache hit. The program reads d directly 
from level k, which by the nature of the memory hierarchy is faster than reading 
d from level k + 1. For example, a program with good temporal locality might read 
a data object from block 14, resulting in a cache hit from level k. 


Cache Misses 


If, on the other hand, the data object d is not cached at level k, then we have what 
is called a cache miss. When there is a miss, the cache at level k fetches the block 
containing d from the cache at level k + 1, possibly overwriting an existing block 
if the level k cache is already full. 

This process of overwriting an existing block is known as replacing or evicting 
the block. The block that is evicted is-sometimes referred to as a victim block. 
The decision about which block to replace is governed by the cache's replacement 
policy. For example, a cache with a random replacement policy would choose a 
random victim block. A cache with a least recently used (LRU) replacement policy 
would choose the block that was last accessed the furthest in the past. XE 

After the cache at level k has fetched the block from level k + 1, the program 4 
can read d from level k as before. For example, in Figure 6.22, reading a data object 
from block 12 in the level k cache would result in a cache miss because block 12 is 
not currently stored in the level k cache. Once it has been copied from level k +1 
to level k, block 12 will remain there in expectation of later accesses. 


| 


Lol sc 


Kinds of Cache Misses 


It is sometimes helpful to distinguish between different kinds of cache misses. If 3i 
the cache at level k is empty, then any access of any data object will miss. An j 
empty cache is sometimes referred to as a cold cache, and misses of this kind are SIE 
called compulsory misses or cold misses. Cold misses are important because they 4 
are often transient events that might not occur in steady state, after the cache has 
been warmed up by repeated memory accesses. 
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Whenever there is a miss, the cache at level k must implement some placement 
policy that determines where to place the block it has retrieved from level k + 1. 
The most flexible placement policy is to'allow any block from level k + 1 to be 
stored in any block at level k. For caches high in the memory hierarchy (close to 
the CPU)'that are implemented in hardware and where speed is at a premium, 
this policy is usually too expensive to implement because randomly placed blocks 
are expensive to locate. 

Thus, hardware caches typically implement a simpler placement policy that 
restricts a particular block at level k + 1 to a small subset (sometimes a singleton) 
of the blocks at level k. For example, in Figure 6.22, we might decide that a block 
i at level k + 1 must be placed in block (i mod 4) at level k. For example, blocks 
0, 4, 8, and 12 at level k + 1 would map to block 0 at level &; blocks 1, 5, 9, and 
13 would map to block 1; and so on. Notice that our example cache in Figure 6.22 
uses this policy. 

Restrictive placement policies of this kind lead to a type of miss known as 
a conflict miss, in which the.cache is large enough to hold the referenced data 
objects, but because they map to the same cache block, the cache keeps missing. 
For example, in Figure 6.22, if the program requests block 0, then block 8, then 
block 0, then block 8, and so on, each of the refer'ences to these two blocks would 
miss in the cache at level k, even though this cache can hold a total of four blocks. 

Programs often run as a sequence of phases (e.g., loops) where each phase 
accesses some reasonably constant set of cache blocks. For example;'a nested loop 
might access the elements of the same array over and over again. This set of blocks 
is called the working set of the phase. When the size of the working set exceeds 
the size of the cache, the cache will experience what are known as capacity misses. 
In other words, the cache is,just too small to handle this particular working set. 


Cache Management 


As we have noted, the essence of the memory hierarchy is that the storage device 
at each level is a cache for the next lower level. At each level, some form of logic 
must manage the cache. By this we mean that something has to partition the cache 
storage into blocks, transfer blocks between different levels, decide when there are 
hits and misses, and then deal with them. The logic that manages the cache can be 
hardware, software, or a combination of the two. 

For example, the compiler manages the register. file, the highest level of 
the cache hierarchy. It decides when to issue loads when there are misses, and 
determines which register to store the datin. The caches at levels L1, L2, and 
L3 are managed entirely by hardware logic built into the caches. In a system 
with virtual memory, the DRAM main memory serves as a cache for data blocks 
stored on disk, and is managed by a combination of operating system software 
and address translation hardware on the CPU. For a machine with a distributed 
file system such as AFS, the local disk serves as a cache that is managed by the 
AFS client process running on the local machine. In most cases, caches operate 
automatically and do not require any specific or explicit actions from the program. 
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j Type What cached Where cached Latency (cycles) Managed by 

CPU registers 4-byte or 8-byte words On-chip CPU registers 0 Compiler 

l TLB Address translations On-chip TLB 0 Hardware MMU 

| L1 cache 64-byte blocks On-chip L1 cache 4 Hardware 
L2 cache 64-byte blocks On-chip L2 cache 10 Hardware 

" L3 cache 64-byte blocks On-chip L3 cache 50 Hardware 
Virtual memory 4-KB pages Main memory 200 | Hardware + OS 
Buffer cache Parts of files Main memory 200 OS 
Disk cache Disk sectors Disk controller 100,000 | Controller firmware 
Network cache Parts of files Local disk 10,000,000 NFS client 

: g Browser cache Web pages Local disk 10,000,000 — Web browser 





Remote server disks 1,000,000,000 Web proxy server 


Web cache Web pages 






Figure 6.23 The ubiquity of caching in modern computer systems. Acronyms: TLB: translation lookaside 
buffer; MMU: memory management unit; OS: operating system; NFS: network file system. 










6.3.2 Summary of Memory Hierarchy Concepts 


To summarize, memory hierarchies based on caching work because slower storage 
a is cheaper than faster storage and because programs tend to exhibit locality: 






i Exploiting temporal locality. Because of temporal locality, the same data objects 
are likely to be reused multiple times. Once a data object has been copied 
into the cache on the first miss, we can expect a number of subsequent 
hits on that object. Since the cache is faster than the storage at the next 
h lower level, these subsequent hits can be served much faster than the 
i original miss. 










Exploiting spatial locality. Blocks usually contain multiple data objects. Because 
i of spatial locality, we can expect that the cost of copying a block after a 
i miss will be amortized by subsequent references to other objects within 
: that block. 














Caches are used everywhere in modern systems. As you can see from Fig- 

ure 6.23, caches are used in CPU chips, operating systems, distributed file systems, 
i and on the World Wide Web. They are built from and managed by various com- 
binations of hardware and software. Note that there are a number of terms and 
acronyms in Figure 6.23 that we haven’t covered yet. We include them here to 
demonstrate how common caches are. 









6.4 Cache Memories 






The memory hierarchies of early computer systems consisted of only three levels: 
CPU registers, main memory, and disk storage. However, because of the increasing 
gap between CPU and main memory, system designers were compelled to insert 
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Figure 6.24 CPU chip 
Typical bus structure for n 
cache memories. 






Register file 


58 Gache * 
< mertiories. 







System bus 





a small SRAM cache memory, called an LI cache (level 1 cache) between the 
CPU register file and main memory, as shown in Figure 6.24. The L1 cache can be 
accessed nearly as fast as the registers, typically in about 4 clock cycles. 

As the performance gap between the CPU and main memory continued 
to increase, system designers responded by inserting an additional larger cache, 
called an L2 cache, between the L1 cache and main memory, that can be accessed 
in about 10 clock cycles. Many modern systems include an even larger cache, called 
an L3 cache, which sits between the L2 cache and main memory in the memory 
hierarchy and can be accessed in about 50 cycles. While there is considerable 
variety in the arrangements, the general principles are the same. For our discussion 
in the next section, we will assume a simple memory hierarchy with a single L1 
cache between the CPU and main memory. 

a of 


6.4.1 Generic Cache Memory Organization 


Consider a computer system where each memory address has m bits that form 
M — 2" unique addresses. As illustrated in Figure 6.25(a), a cache for such a 
machine is organized as an array of S = 2° cache sets. Each set consists of E cache 
lines. Each line consists of a data block of B = 2” bytes, a valid bit that indicates 
whether or not the line contains meaningful information, and t = m — (b + s) tag 
bits (a subset of the bits from the current block's memory address) that uniquely 
identify the block stored in the cache line. 

In general, a cache's organization can be characterized by the tuple (5, E, 
B, m). The size (or capacity) of a cache, C, is stated in terms of the aggregate size 
of all the blocks. The tag bits and valid bit are not included. Thus, C — S x E x B. 

When the CPU is instructed by a load instructiorrto read a word from address 
A of main memory, it sends address A to the cache. If the cache is holding a copy 
of the word at address A, it sends the word immediately back to the CPU. So how 
does the cache know whether it contains a copy of the word at address A? The 
cache is organized so that it can find the requested word by simply inspecting the 
bits of the address, similar to a hash table with an extremely simple hash function. 
Here is how it works: 

The parameters S and B induce a partitioning of the m address bits into the 
three fields shown in Figure 6.25(b). The s set index bits in A form an index into 


Memory bus 





615 





616 Chapter6 The Memory Hierarchy 





Figure 6.25 1 valid bit ttag bits = 2° bytes 
General organization per line per line is cache block 
—_ oo — 


of cache (S, E, B, m). 
(a) A cache is an array 
of sets. Each set contains 
one or more lines. Each 
line contains a valid bit, 
some tag bits, and a block 
of data. (b) The cache 
organization induces a 
partition of the m address 
bits into ¢ tag bits, s set 
index bits, and b block 
offset bits. 


E lines per set 


S= 2? sets 


Cache size: C= Bx Ex S data bytes 
{a) 
t bits s bits bbits 


m1 0 
Tag Set index Block offset 


(b) 


the array of S sets. The first set is set 0, the second set is set 1, and so on. When 
interpreted as an unsigned integer, the set index bits tell us which set the word 
must be stored in. Once we know which set the word must be contained in, the t 
tag bits in A tell us which line (if any) in the set contains the word. A line in the 
set contains the word if and only if the valid'Uit is set arid the tag’bits in the line 
match the tag bits in the address A. Once we have located the line identified by 
the tag in the set identified by the set index, then the b block offset bits give us the 
offset of the word in the B-byte data block. 

As you máy have noticed, descriptions‘of caches use a lot of symbols.’ Fig- 
ure 6.26 summarizes these symbols for your refererice. 





The following table g gives s the parameters for a number of différent' mm For 
each cache, determine the number of cache sets (S), tag bits (t), set index bits (s), 


and block offset bits (5). x 
Cache m C B E $ t s b 

1. 32 1,024 4 1 EEE abonos as, 

2. 32 1,024 8 4 AINMEUS ee 


3. 32 1,024 32 32 
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Parameter Description 


Fundamental parámetérs- | 


$22 ~- -Number-of sets 

E Number of lines per set 

B=2? Block size (bytes) 

m = log, (M) Number of physical (main memory) address bits 
Derived quantities 

M =2" Maximum number of unique memory addresses 
s =log,(5) Number of set index bits 

b =log,(B) Number of'block offset bits 


t=m—(s+b) Number of tag bits 
C=BxExS Cache size (bytes), not including overhead such as the valid and tag bits 


Figure 6.26 Summary of cache parameters. 


Figure 6.27 
Direct-mapped cache 
(E =1). There is exactly 
one line per set. 





Set 0: [Valid] ' .|  Cacheblock P. IE = 1 line per set 
ji [Cache block | 


y S IUE cC EG 
i Cache block d 










Sett:[ 


1 
6.4.2 Direct-Mapped Caches 


Caches are grouped into different classes based on E, the number of cache lines 
per set. A cache with exactly one line per set (E = 1) is known as a direct-mapped 
cache (see Figure 6.27). Direct-mapped caches are the simplest both to implement 
and to understand, so we will use them to illustrate some general concepts about 
how caches work. 

Suppose we have a system with a CPU, a register file, an L1 cache, and a main 
memory..When'the CPU: executes an instruction that reads a memory word w, 
it requests the Word from the L1 cache. If the L1 cache has a cached copy of w, 
then we have an L1 cache hit, and the cache quickly extracts w and returns it to 
the GPU. Otherwise, we have a cache miss, and the CPU must wait while the L1 
cache requests a copy of the block containing w from the main memory. When 
the requested block finally arrives from memory, the L1 cache stores the block in 
one of its cache lines, extracts word w from the stored block, and returns it to the 
CPU. The process that a cache'goes through of determiningrwhether a request is a 
hit or a miss and then extracting the requested word consists of three steps: (1) set 
selection, (2) line matching, and (3) word extraction. 
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Figure 6.28 
Set selection in a direct- 


mapped cache. Selected set i Cache block ] 


tbits s bits b bits x 7 
m ODOT 2] : [ Cacheblok —] Cache block _ 


m1 


Tag Set index Block offset 


Figure 6.29 =1? (1) The valid bit must be set. 

Line matching and word 

selection in a direct- 

mapped cache. Within the 

cache block, wg denotes (3) if (1) and (2), then 

the low-order byte of the (2) The tag bits in the cache hit, and 

wor he next byte. cache line must block offset selects 
d w, wy the next byte, match the tag bits starting byte. 

and so on. in the address. 


0 
Set index Block offset 


Set Selection in Direct-Mapped Caches 


In this step, the cache extracts the s set index bits from the middle of the address 
for w. These bits are interpreted as an unsigned integer that corresponds to a set 
number. In other words, if we think of the cache as a one-dimensional array of 
sets, then the set index bits form an index into this array. Figure 6.28 shows how 
set selection works for a direct-mapped cache. In this example, the set index bits 
00001, are interpreted as an integer index that selects set 1. 


Line Matching in Direct-Mapped Caches 


Now that we have selected some set i in the previous step, the next step is to 
determine if a copy of the word w is stored in one of the cache lines contained in 
set i. In a direct-mapped cache, this is easy and fast because there is exactly one 
line per set. A copy of w is contained in the.line if ahd only if the valid;bit js set 
and the tag in the cache line matches the tag in the address of w. 

Figure 6.29 shows how line matching works in a. direct-mapped cache: In this 
example, there. is exactly one cache line in the selected set. The valid bit for this 
line is set, so we know that the bits in the tag and block are meaningful. Since the 
tag bits in the cache line match the tag bits in the address, we know that a copy of 
the word we want is indeed stored in the line. In other words, we have a cache hit. : 
On the other hand, if either the valid bit were not set or the tags did not match, ; 
then we would have had a cache miss. 
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Word Selection in Direct-Mapped Caches’ 


Once we have a hit, we know that w is somewhere in the block. This last step 
determines where the desired word starts in the block. As shown in Figure 6.29, 
the block offset bits provide us with the offset of the first byte in the desired word. 
Similar to our view of a cache as an array of lines, we can think of a block as an 
array of bytes, and the byte offset as an index into that array. In the example, the 
block offset bits of 100, indicate that the copy of w starts at byte 4 in the block. 
(We are assuming that words are 4 bytes long.) 


Line Replacement on Misses in Direct-Mapped Caches 


If the cache misses, then it needs to retrieve the requested block from the next 
levelin the memory hierarchy and store the new block in one of the cache lines of 
the set indicated by the set index bits. In general, if the set is full of valid cache lines, 
then one of the existing lines must be evicted. For a direct-mapped cache, where 
each set contains exactly one line, the replacement policy is trivial: the current line 
is replaced by the newly fetched line. 


Putting It Together: A Direct-Mapped Cache in Action 


The mechanisms that a cache uses to select sets and identify lines are extremely 
simple. They have to be, because the hardware must perform them in a few 
nanoseconds. However, manipulating bits in this way can be confusing to us 
humans. A concrete example will help clarify the process. Suppose we have a 
direct-mapped cache described by 


f 


(S, E, B, m) = (4,1, 2, 4) 


In other words, the cache has fóur sets, one line per set, 2 bytes per block, and 4- 
bit addresses. We will also,assume that,each word is a single byte. Of course, these 
assumptions are totally unrealistic, but they will help us keep the example simple. 

When you are first learning about caches, it can be very instructive to enumer- 
ate the entire address space and partition the bits, as we've done in Figure 6.30 for 
our 4-bit example. There are some interesting things to notice about this enumer- 
ated space: 


* The concatenation of the tag and index bits uniquely identifies each block in 
memory. For example, block 0 consists of addresses 0 and 1, block 1 consists 
of addresses 2 and 3, block 2 consists of addresses 4 and 5, and so on. 


* Since there are eight memory blacks but only four cache sets, multiple blocks 
map to the same cache set (i.e., they haye the same set index). For example, 
blocks 0 and 4 both map to set 0, blocks 1 and 5 both map to set 1, and so on. 


* Blocks that map to the same cache set are uniquely identified by the tag. For 
example, block 0 has a tag bit of 0 while block 4 has a tag bit of 1, block 1 has 
a tag bib of 0 while block 5 has a tag bit of 1, and so on. 
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Address bits j 
Address Tag bits Indexbits Offset bits Block number 
(decimal) (t=1) (s =2) (b=1) (decimal) 
0 0 0 |^ 0 0 
1 0 00 1 0" 
2 0 01 0 1 
3 0 01 1 1 
4 0 10 0 2 
5 0 10 1 2 
6 0 11 0 3 
7 0 11 1 3 
8 1 00 0 4 
9 1 00 1 4 
10 1 01 0 5 
11 1 01 1 5 
12 1 10 0 6 
13 1 10 1 6 
14 1 11 0 7 "ü 
15 1 ji 1 7 


Figure 6.30 4-bit address space for example direct-mapped cache. 


Let us simulate the cache in action as the CPU performs a sequefice of reads. 
Remember that for this example we are assuming that the CPU reads 1-byte 
words. While this kind of manual simulation is tedious and you may be tempted 
to skip it, in our experience students do not really understand how caches work 
until they work their way through a'few of them. x 

Initially, the cache is empty (i.e., eath valid bit is 0): ^ 


Set Valid Tag ‘block[0] block[1] 
0 


Ww Ne © 


0 
0 
0 


Each row in the table represents a cache line. The first column indicates the set 
that the line belongs to, but keep in mind that this is provided for convenience and 
is not really part of the cache. The next four columns represent the actual bits in 
each cache line. Now, let’s see what happens when the CPU performs a sequence 
of reads: ' 


1. Read word at address 0. Since the valid bit for set 0 is 0, this is a cache miss. 
The cache fetches block 0 from memory (or a lower-level cache) and stores the 
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block in set 0. Then the cache returns m[0] (the contents of memory location 
0) from block[0] of the newly fetched cache line. 


Set, Valid Tag ,, block[0] block[1] 
a ee eee 
1 0 m[0] m[t1] 


V Ne © 


0 
0 
0 


2. Read word at address 1. This is a cache hit. The cache immediately returns 
m[1] from block[1] of the cache line. The state of the cache does not change. 


3. Read word at address 13. Since the cache line'in set 2 is not valid, this is a 
cache miss. The cache loads block 6 into set 2 and returns m[13] from block[1] 
of the new cache line. 


tot 


Set Valid Tag “block{]’_, block{1] . 


0 1 « ^0 m[0] m[i] 

1 0 m 

2 1 1 m[12] m[13] 1 
Y f 

3 


4. Read word at addfess 8. This is‘a miss. The cache line in set 0 is indeed valid, 
but the tags do not match. The cache loads block 4 into set 0° (replacing the 
line that was there from the read of address 0) and returns m[8] from block[0] 
of the new cache line. 


Set alid Tag ^ block(0] ^ block[1] 
0 1 1 m[8] m[9] 
1 0 
2 1 1 m[12] m[13] 
3 0 


5. Read word at address 0. This is another miss, due to the unfortunate fact 
that we just replaced block 0 during the previous reference to address 8. This 
kind of miss, where we have plenty of room in the cache but keep alternating 

1 references to blocks that map to the same set, is an example of a conflict miss. 
m 


Set Valid Tag ^ block[0]  block[1] 
0 m[0] m[1] 
i 


mÓ 


0 
1 0 s 

2 1 1 m[12] m(13] R 
3 0 
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Conflict Misses in Direct-Mapped Caches 


Conflict misses are common in real programs and can cause baffling performance 
problems. Conflict misses in direct-mapped caches typically occur when programs 
access arrays whose sizes are a power of 2. For example, consider a function that 
computes the dot product of two vectors: 


float dotprod(float x[8], float y(8]) 
{ 

float sum = 0.0; 

int i; 


for (i = 0; i < 8; i++) 
sum += x[i] * ylil; 
return sum; 


1 
2 
3 
4 
5 
6 
7 
8 
9 


} 


This function has good spatial locality with respect to x and y, and so we might ex- 
pect it to enjoy a good number of cache hits. Unfortunately, this is not always true. 

Suppose that floats are 4 bytes, that x is loaded into the 32 bytes of contiguous 
memory starting at address 0, and that y starts immediately after x at address 32. 
For simplicity, suppose that a block is 16 bytes (big enough to hold four floats) 
and that the cache consists of two sets, for a total cache size of 32 bytes. We will 
assume that the variable sum is actually stored in a CPU register and thus does not 
require a memory reference. Given these assumptions, each x [i] and y[i] will 
map to the identical cache set: 


Element ^ Addréss Set index Element Address Set index 


x[0] 0 y [0] 32 0 
x[11 y 36 
x[2] y [2] 40 
x[3] y [3] 44 
x[4] y [4] 48 
x[5] y [5] 52 
x16] yt6l 56 
x[7] yl7i 60 


1 

At run time, the first iteration of the loop references x [0], a miss that causes 
the block containing x [0]-x [3] to be loaded into set 0. The next reference is to | 
y [0], another miss that causes the block containing y [0]—-y [3] to be copied into į 
set 0, overwriting the values of x that were copied in by the previous reference. 1 
During the next iteration, the reference to x [1] misses, which causes the x [0]- | 
x [3] block to be loaded back into set 0, overwriting the y [0] —y [3] block. So now 3 
we have a conflict miss, and in fact each subsequent reference to x and y will result 
in a conflict miss as we thrash back and forth between blocks-of x and y. The term j 
thrashing describes any situation where a cache is repeatedly loading and evicting | 
the same sets of cache blocks. 


v 


bKBPrPrPe Oo OX 
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Aside Why index with the middle bits? 


You may be wondering why caches use the middle bits for the set index instead of the high-order bits. 
There is a good reason why,the middle bits are better. Figure 6.31 shows why. If the high-order bits are 
used as an index, then some contjguous memory blocks will map.to the same cache set. For example, in 
the figure, the first four blocks map.to the first cache set, the second four blocks map to the second set, 
and.so on. If a program has good spatial locality and-scans the elements of'an array sequentially, then 
the cache can only hold ablock-size chunk of the array at any point in time. This is an inefficient use of 
the cache. Contrast this,with middle-bit indexing, where adjacent blocks always map to different cache 
sets. In this case, the cache can hold an entire C-size chunk ofthe array, where C is the cache size. 


~~, 


High-order Middie-order 
bit indexing bit indexing 





Figure 6.31 Why caches index with the middle bits. 


The bottom line is that even though the program has good spatial locality 
and we have room in the cache to hold the blocks for both x[i] and y fi], each 
reference results in a conflict miss because the blocks map to the same cache set. It 
is not unusual for this kind of thrashing to result in a slowdown by a factor of 2 or 3. 
Also, be aware that even though our example is extremely simple, the problem is 
real for larger and more realistic direct-mapped caches. 

Luckily, thrashing is easy for programmers to fix once they recognize what is 
going on. One easy solution is to put B bytes of padding at the end of each array. 
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For example, instead of defining x to be float x[8], we define it to be float 
x[12]. Assuming y starts immediately after x in memory, we have the following 


mapping of array elements to sets: 











Element Address Set index 





Set index 





Element Address 
x[0] 0 0 y [0] 48 1 












x[1 4 0 y(t] 52 1 
x[21 8 0 y [2] 56 1 
x[3] 12 0 y 3] 60 1 
x[4] 16 1 y [4] 64 0 
x[5] 20 1 y [53 68 0 
x(6] 24 1 y (6) 72 0 
x[7] 28 1 yt 76 0 





With the padding at the end of x, x[i] and y [i] now map to different sets, which 
eliminates the thrashing conflict misses. 










Imagine a hypothetical cache that uses the high-order s bits of an address as the 
set index. For such a cache, contiguous chunks of memory blocks are mapped to 


the same cache set. 
A. How many blocks are in each of these contiguous array chunks? 
B. Consider the following code that runs on a system with a cache of the form | 
(S, E, B, m) = (512, 1, 32, 32): 














int array [4096]; 






for (i = 0; i < 4096; i++) 
sum += array [i]; 






What is the maximum number of array blocks that are stored in the cache 
at any point in time? | 









6.4.3 Set Associative Caches 





The problem with conflict misses in direct-mapped caches stems from the con- 
straint that each set has exactly one line (or in our terminology, E = 1). Aset 
associative cache relaxes this constraint so that each set holds more than one cache 
line. A cache with 1 « E < C/B is often called an E-way set associative cache, We 
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Figure 6.32 FL ——— zm 
Set associative cache Set 0: | valig {s [[ Cache biock J 
(1 < E < C/B). In a set | [Valid |, i : 
associative cache, each 
set contains more than 
one line. This particular Set 1: zs 
example shows à two-way 
set associative cache. 








E=2 lines per set 






























Set 8-1: [== 
Figure 6.33 | * 
Set selection in a set 'Set 0: | i | 
associative cache. 4L. Cache block i 
Selected set Sank: Vaid} Tag ]. Cache block |. 
'"L[Valid]ai Tag |j Cache block || 
[vana] Tag ]. Cache block 
; ; ; Set S-1: ME zs z 
tbits s bits b bits MValidl,[ Ta Cache block |; 
fT —19000011 . 3] Wao) of Tag - 
m1 "0 
E Tag Set index, Block offset 


will discuss the special case, where E = C/B, in the next section. Figure 6.32 shows 
the organization of a two-way set associative cache. 


r 


Set Sèlection in Set Associative Caches 


Set selection is identical to a direct-mapped cache, with the set index bits identi- 
fying the set. Figure,6.33 summarizes this principle. 


RI 


Line Matching and- Word Selection in Set Associative Caches 


* 

Line matching is more involved in a set associative cache than in a direct-mapped 
cache because it must check the, tags and valid bits of multiple lines in order to 
determine if the requested word is in the set. A conventional memory is an array of 
values that takes an address as input and returns the value stored at that address. 
An associative memory, on the other hand, is an array of (key, value) pairs that 
takes as input the key and returns a value from one of the (key, value) pairs that 
matches the input key. Thus, we can think of each set in a set associative cache as 
a small associative memory where the keys are the concatenation of the tag and 
valid bits, and the values are the contents of a block. 
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Figure 6.34 
Line matching and 


word selection in a set 


associative cache. 


=1? (1) The valid bit must be set. 


Selected set (i): 






(2) The tag bits in one (3) 1f (1) and (2), then 


i cache hit, and 
of the cache lines " 
must match the tag e 
bits in the address. tbits sitis DE 
ono [ 7 1 100 ] 
m1 0 


Tag Set index Block offset 


Figure 6.34 shows the basic idea of line matching in an associative cache. An 
important idea here is that any line in the set can contain any of the memory blocks 
that map to that set. So the cache must search each line in the set for a valid line 
whose tag matches the tag in the address. If the cache finds such a line, then we 
have a hit and the block offset selects a word from the block, as before. 


Line Replacement on Misses in Set Associative Caches 


If the word requested by the CPU is not stored in any of the lines in the set, then 
we have a cache miss, and the cache must fetch tbe block that contains the word 
from memory. However, once the cache has retrieved the block, which line should 
it replace? Of course, if there is an empty line, then it would be a good candidate. 
Butifthere are no empty linesin the set, then we must choose one of the nonempty 
lines and hope that the CPU does not reference the replaced line anytime soon. 

Itis very difficult for programmers to exploit knowledge of the cache replace- 
ment policy in their codes, so we will not go into much detail about it here. The 
simplest replacement policy is to choose the line to replace at random. Other more 
sophisticated policies draw on the principle of locality to try to minimize the prob- 
ability that the replaced line will be referenced in the near future. For example, a 
least frequently used (LFU) policy will replace the line that has been referenced 
the fewest times over some past time window. A least recently used (LRU) policy 
will replace the line that was last accessed the furthest in the past. All of these 
policies require additional time and hardware. But as we move further down the 
memory hierarchy, away from the CPU, the cost of a miss becomes more expen- 
sive and it becomes more worthwhile to minimize misses with good replacement 
policies. 


6.4.4 Fully Associative Caches 


A fully associative cache consists of a single set (i.e., E = C/B} that contains all of 
the cache lines. Figure 6.35 shows the basic organization. 
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Figure 6.35 

Fully associative cache 
(E = C/B). In a fully 
associative cache, a single 
set contains all of the lines. 


Ez C/B lines in 


Set 0: the one and only set 











Figure 6.36 
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Set Selection in Fully Associative Caches 


Set selection in a fully associative cache is trivial because there is only one set, 
summarized in Figure 6.36. Notice that there are no set index bits in the address, 
which is partitioned into only a tag and a block offset. 


Line Matching and Word Selection in Fully Associative Caches 


Line matching and word selection in a fully associative cache work the same as 
with a set associative cache, as we show in Figure 6.37. The difference is mainly a 
question of scale. 

Because the cache circuitry must search for many-matching tags in parallel, it 
is difficult and expensive to build an associative cache that is both large and fast. 
As a result, fully associative caches are only appropriate for small caches, such 





- 
we 


| 
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as the translation lookaside buffers (TLBs) in virtual memory systems that cache 
page table entries (Section 9.6.2). 





The problems that follow will help reinforce your undcritendims of iow cies 
work. Assume the following: 


* The memory is byte addressable. 

* Memory accesses are to 1-byte words (not to 4-byte words). 

* Addresses are 13 bits wide. 

* The cache is two-way set associative (E — 2), with a 4-byte block size (B = 4) 
and eight sets (S = 8). 


3? 


The contents of the cache are as follows, with all numbers given in hexadecimal 
notation. 


2-way set associative cache 








Set index 
0 


o ON n WN Pe 


Line 0 Line 1 





ee a a ee ee ee ne EE RENS 
Tag Valid ByteO Bytel Byte2 Byte3 Tag Valid ByteO Bytel Byte2 Byte3 


09 
45 
EB 
06 
C7 
71 
91 
46 


1 86 30 3F 10 00 0 — -— i = 


1 60 4F EO 23 38 1 00 BC OB 37 
b = = — — 0B 0 — — — — 
0 — — — — 32 1 12 08 7B AD 
1 06 78 07 C5 05 1 40 67 C2 3B 
1 0B DE 18 4B 6E 0 — — — — 
1 AD B7 26 2D F0 0 — — — — 
0 — -— — — DE 1 12 CO 88 37 


The following figure shows the format of an address (1 bit per box). Indicate 
(by labeling tlie diagram) the fields that would be used to determine the following: 


CO. The cache block offset 
CI. The cache set index 
CT. The cache tag 


12 11 10 9 B 7 6 5 4 3 2 1 Q 
ania Pls "ME A gue 2 
Practice Prob em &;13- (solution page.664), <2 3 cs cac EE 


Sapca a program running on the machine in Problem 6.12 references the 1-byte 
word at address 0x0E34. Indicate the cache entry accessed and the cache byte 
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value returned in hexadecimal notation. Indicate whether a cache miss occurs: If 
there is a cache miss, enter “—-” for *Gache byte returned.” 


A. Address format (1 bit per box): 


B. Memory reference: 








Parameter Value 
Cache block offset (CO) Ox. 

Cache set index (CI) Ox 

Cache tag (CT) Ox = 
Cache hit? (Y/N) 

Cache byte returned Ox 








Ree Problem 6: 13 ra emoty address OxODDS, 
A. Address format (T bit per box): 


12; 


Li 
B. Memory reference; 











Parameter , _, Value 
Gáche block offset (CO)?! Ox... 
" Cache setindex (CI) ! Ox 
Cache tag (CT) Ox 
Cache hit? (Y/N) D" 
Cache byte returned Ox... 





TURA Problem 6.13 fori memory y addres oxi RA. n , 
A. Address format (1 bit per box): 
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B. Memory reference: 


Parameter Value 
Cache block offset (CO) Ox. 
Cache set index (CI) Ox 2s 
Cache tag (CT) Oxon 
Cache hit? (Y/N) uis 
Cache byte returned xcu s 






usce eh ve pei rr Pint Mes d dd da Partes 


» page.665)_ 2) — 
For the cache in Problem 6.12, list all of the hexadecimal memory addresses that 
will hit in set 3. 


iPtaciice.P weg e Prol 





6.4.5 Issues with Writes 


As we have seen, the operation of a cache with respect to reads is straightforward. 
First, look for a copy of the desired word w in the cache. If there is a hit, return 
w immediately. If there is a miss, fetch the block that contains w from the next 
lower level of the memory hierarchy, store the block in some cache line (possibly 
evicting a valid line), and then return w. 
The situation for writes is a little more complicated. Suppose we write a word 
w that is already cached (a write hit). After the cache updates its copy of w, what 
does it do about updating the copy of w in the next lower level of the hierarchy? 
The simplest approach, known as write-through, is to immediately write w’s cache 
block to the next lower level, While simple, write-through has the disadvantage 
of causing bus traffic with every write. Another approach, known as write-back, 
defers the update as long as possible by writing the updated block to the next lower 
level only when it is evicted from the cache by the replacement algorithm. Because 
: of locality, write-back can significantly reduce the amount of bus traffic, but it has 
the disadvantage of additional complexity. The cache must maintain an additional 
dirty bit for each cache line that indicates whether or not the cache block has been 
modified. 

Another issue is how to deal with write misses. One approach, known as write- 
allocate, loads the corresponding block from the next lower level into the cache 
and then updates the cache block. Write-allocate tries to exploit spatial locality 
of writes, but it has the disadvantage that every miss results in a block transfer 
from the next lower level to the cache. The alternative, known as no-write-allocate, 
bypasses the cache and writes the word directly to the next lower level. Write- 
through caches are typically no-write-allocate. Write-back caches are typically 
write-allocate. 

Optimizing caches for writes is a subtle and difficult issue, and we are only 
scratching the surface here. The details vary from system to system and are often 
proprietary and poorly documented. To the programmer trying to write reason- 
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ably cache-friendly programs, we suggest adopting a mental model that assumes 
write-back, write-allocate caches, There are several reasons for this suggestion: As 
arule, caches at lower levels of the memory hierarchy are more likely to use write- 
back instead of write-through because of the larger transfer times. For example, 
virtual memory systems (which use main memory as a cache for the blocks stored 
on disk) use write-back exclusively. But as logic densities increase, the increased 
complexity of write-back is becoming less of an impediment and we are seeing 
write-back caches at'all levels of modern systems. So this assumption matches cur- 
rent trends. Another reason for assuming a write-back, write-allocaté approach is 
that it is symmetric to the way reads are handled, in that write-back write-allocate 
triés to éxploit locality. Thus, we can develop our programs at a high level to exhibit 
good spatiàl and temporal locality rather than trying to optimize for a particular 
memory system. ^ m 


6.4.6 Anatomy of a Real Cache Hierarchy 


So far, we have assumed that.cache$ hold'only program data. But, in fact, caches 
can hold instructions as well as data. A cache that holds instructions only is called 
an i-cache. A cache that holds program data only is called à d-cache. A cache that 
holds both instructions and data is known as a unified cache. Modern processors 
include-separate i-caches and-d-caches. There are a number-of reasons for this. 
With two separate caches, the processor can read an instruction word and a data 
word at the same time. I-caches are typically read-only, and thus simpler. The 
two caches are often optimized to different access patterns and can have different 
block sizes, associativities, and capacities. Also, having separate caches ensures 
that data accesses do not create conflict misses with instruction accesses, and vice 
versa, at the cost of a potential increase in capacity misses. 

Figure 6.38 shows the cache hierarchy for the Intel Core i7 processor. Each 
CPU chip has four cores. Each core has its own private L1 i-cache, L1 d-cache, and 
L2 unified cache. All of the cores share an on-chip L3 unified cache, An interesting 
feature of this hierarchy is that all of the SRAM cache memories are contained in 
the CPU chip. 

Figure 6.39 summarizes the basic characteristics of the Core i7 caches. 

J t 


6.4.7 Performance Impact of Cache Parameters 
Cache performance is evaluated with a number of metrics: 


Miss rate. The fraction. of memory references dyring the execution of a pro- 
gram, or a part of a program, that miss. It is computed as T misses/ 
# references. 


Hit rate. The fraction of memory references that hit. It is computed as 


1 — miss rate. . 


Hit time. The time to deliver a word in the cache to the CPU, including the time 
for set selection, line identification, and word selection. Hit time is on the 
order of several clock cycles for L1 caches. 
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Figure 6.38 Processor package 
Intel Core i7 cache 
hierarchy. 
eas pi ave AFL T 
sdgach H [Sicachė . 
puke unified cache." d cáche*« 
p CER ME s ‘Fen P 
ecu 
««(sháred by all córes) t 
nn 
Cache type Access time (cycles)  Cachesize(C) ^ Assoc. (E)  Blocksize(B) Sets (S) 
L1 i-cache 4 32 KB 8 64 B 64 
L1 d-cache 4 32 KB 8 64 B 64 
L2 unified cache 10 256 KB 8 64 B 512 
L3 unified cache 40-75 8MB . 16 64 B 8,192 


L3 unified cache — 0 70 ILL——————————————————————— 
Figure 6.39 Characteristics of the Intel Core i7 cache hierarchy. 


Miss penalty. Any additional time required because of a miss. The penalty for 
L1 misses served from L2 is on the order of 10 cycles; from L3, 50 cycles; 
and from main memory, 200 cycles. 


Optimizing the cost and performance trade-offs of cache memories is a subtle 
exercise that requires extensive simulation on realistic benchmark codes and thus 
is beyond our scópe. Howéver, it is possible to identify some of 'the qualitative 
trade-offs. 


Impact of Cache Size 


On the one hand, a larger cache will tend to increase the hit rate. On the other | 
hand, it is always harder to make large memories run faster. As a result, larger ] 
caches tend to increase the hit time. This explains why an L1 cache is smaller than 1 


an L2 cache, and an L2 cache is smaller than an L3 cache.. 
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Impact of Block Size 


Large blocks are a mixed blessing. On the one hand, larger blocks can help 
increase the hit rate by exploiting any spatial locality that might exist in a program. 
However, for a given cache size, larger blocks imply a smaller number of cache 
lines, which can hurt the hit rate in programs with more temporal locality than 
spatial locality. Larger blocks also have a negative impact ón the miss penalty, 
since larger blocks cause larger transfer times. Modern systems such as the Core 
17 compromise with cache blocks that contain 64 bytes. 


impact of Associativity 


The issue here is the impact of the choice of the parameter Æ, the number of 
cache lines per set. The advantage of higher associativity (i.e., larger values of E) 
is that it decreases the vulnerability of the cache to thrashing due to conflict misses. 
However, higher associativity'comes at a significant cost. Higher associativity is 
expensive to implement and hard to make fast. It requires more tag bits per 
line, additional LRU state bits per line, and additional control logic. Higher 
associativity can increase hit time, because of the increased complexity, and it can 
also increase the miss penalty because of the increased complexity of choosing a 
victim line. 

The choice of associativity ultimately boils down to a trade-off between the 
hit time and the miss penalty. Traditionally, high-performance systems that pushed 
the clock rates would opt for smaller associativity for LT caches (where the miss 
penalty is only, a few cycles) and a higher degree of associativity for the lower 
levels, where the miss penalty is higher. For example, in Intel Core i7 systems, the 
Li'and L2 caches are 8-way associative, and the L3 cache is 16-way. 


1 
Impact of Write Strategy 


Write-through caches are simpler to implement and can use a write buffer that 
works independently of the cache to update memory. Furthermore, read misses 
are less expensive because they do not trigger a memory write. On the other 
hand, write-back caches result in fewer transfers, which allows more bandwidth 
to memory for I/O devices that perform DMA. Further, reducing the number of 
transfers becomes increasingly important as we move down the hierarchy and the 
transfer times increase. In general, caches further down the hierarchy are more 
likely to use write-back than write-through. 


6.5 Writing Cache-Friendly Code 


In Section 6.2, we introduced the idea of locality and talked in qualitative terms 
about what constitutes good locality. Now that we understand how cache memo- 
ries work, we can be more precise. Programs with better locality will tend to have 
lower miss rates, and programs with lower miss rates will tend to run faster than 
programs with higher miss rates. Thus, good programmers should always try to 
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Aside Cache lines, sets, and blocks: What's the difference? 
It is easy to confuse the distinction between cache lines, sets, and blocks. Let's.review these ideas and 
make sure they are clear: 






* A block is a fixed-size packet of information that moves back and forth between a cache and main | 
memory (or a lower-level cache). 

* A line is a container in a cache that stores a. block, as well as other information such as the valid 
bit and the tag bits. 

* A set is a collection of one or more lines. Sets in direct-mapped caches consist of a single line. Sets 

in set associative and fully associative caches consist of multiple lines. 










In direct-mapped caches, sets and lines,are indeed equivalent. However, in associdtive caches, sets and 
lines are very different things and the térms cannot be used interchangeably. 

Since a line always stores a single block, the terms “line” and "block? are often used interchange- 
ably. For example, systems professionais usually refer to the "line size" of a cache, when what they 
really mean is the block size. This usage is very common and shouldn't cause any confusion as long as 
you understand the distinction between blocks and lines. 










write code that is cache friendly, in the sense that it has good locality. Here is the 
basic approach we use to try to ensure that our code is cache friendly. 






1. Make the common case go fast. Programs often spend most of their time ina 
few core functions. These functions often spend most of their time in a few 
loops. So focus on the inner loops of the core functions and ignore the rest. 

2. Minimize the number of cache misses in each inner loop. All other things being 
equal, such as the total number of loads and stores, loops with better miss rates 
will run faster. 








To see how this works in practice, consider the sumvec function from Sec- 






i tion 6.2: 
1 int sumvec(int v(N]) 
r 2 1 
d 3 int i, sum - 0; 
4 
5 for (i = 0; i < N; i++) 
6 sum += v{i}; 
7 return sun; 
8 } 


1 Is this function cache friendly? First, notice that there is good temporal locality in 
the loop body with respect to the local variables i and sum. In fact, because these 
are local variables, any reasonable optimizing compiler will cache. them in the 
register file, the highest level of the memory hierarchy. Now consider the stride- 

1 references to vector v. In general, if a cache has a block size of B bytes, then a 
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stride-k reference pattern (where k is expressed in words) results in an average of 
min (1, (word size x k)/B) misses per loop iteration. This is minimized for k = 1, 
so the stride-1 references to v are indeed cache friendly. For example, suppose 
that v is block aligned, words are 4 bytes, cache blocks are 4 words, and the cache 
is initially empty (a cold cache). Then, regardless of the cache organization, the 
references to v will result in the following pattern of hits and misses’ 


v[i] i=0 inl i=2 i23 i=4 i=5 i=6 izÉ7 
Access order, [h]it or [m]iss  1[(m] 2[h]  3[h]  4[h]  S[(m] 6{h]  7[h] 8 [h] 


In this example, the reference to v[0] misses and the corresponding block, 
which contains v [0]—-v [3], is loaded into the cache from memory. Thus, the next 
three references are all hits. The reference to v[4] causes another miss as a new 
block is loaded into the cache, the next three references are hits, and so on. In 
general; thrée-out of four references will hit, which is the’ best we can do in this 
case with a cold cache. 

To summarize, our simple sumvec example illustratés two important points 
about writing cache-friendly code: 


* Repeated'references to local variables are good because the compiler can 
cache them in the register file (temporal locality). 

^ Stride-1reference patterns are good because caches at all levels of the memory 
hierarchy store data as'contiguous blocks (spatial locality). 


Spatial locality is especially important in. programs that operate on multi- 
dimensional arrays. For example; consider;the sumarrayrows function from Sec- 
tion 6.2, which sums the elements of a two-dimensional array in row-major order: 


int sumarrayrows(int a[M]IN]) 


1 

2 t 

3 int i, j, sum = 0; 

4 

5 for (i = 0; i < M; i++) 

6 du for (j = 0; j < N; j++) 
7 sum += afi] [j]; 

8 return sum; 

93 } 


Since C stores arrays in row-major order, the inner loop of this function has 
the same desirable stride-1 access pattern as sumvec. For example, suppose we 
make the same assumptions about the cache as for sumvec. Then the references 
to the array a will result in the following pattern of hits and misses: 


ali] Lil j=0 j=l jp=2 j=3 j=4 J=5 j=6 jz7 
i20 i 20] 3i] 4h sim] 6D] 7B) 8i 


i= 9[m] 10h] 11[h] 12h]  i3(m] 14[h]  15[h]  16[h] 
i=2 17(m]  i8[h]  19[h]  20[h] 2ifm} 22[b] 23fh} 24 fh] 
i-3 25{m}  26[h]  27[h]  28[h] 29[m)  30[n) 31[h)  32[h] 
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But consider what happens if we make the seemingly innocuous change of 
permuting the loops: 


1 int sumarraycols(int a[M] [N]) 

2 { IH 

3 int i, j, sum = 0; i 
4 

5 for (j = 0; j < N; j++) 

6 for (i = 0; i < M; i++) 

7 . sum += ali] Cj]; 

8 return sum; 

9 } 







In this case, we are scanning the array column by column instead of row by row. 
If we are lucky and the entire array fits in the cache, then we will enjoy the same 
miss rate of 1/4. However, if the array is larger than the cache (the more likely 
case), then each and every access of a [i] [j] will miss! 


iz 1 [m] 5 [m] 9.[m] 13 [m] 17 [m] 21 [m] 25 [m] 29 [m] 
i-1l 2 [m] 6 [m] 10 [m] 14 [m] 18 [m] 22 [m] 26 [m] 30 [m] 
i-2 3 [m] 7 [m] 11 [m] 15 [m] 19 [m] 23 [m] 27 [m] 31 [m] 
i=3 4 [m] 8 [m] 12 [m] 16 [m] 20[m]: 24[m] 28 [m] 32 [m] 


Higher miss rates can have a significant impact on running time. Forexample, 
on our desktop machine, sumarrayróws runs’25 times faster than sumarraycols 
for large array sizes. To summarize, programmers should be awaré of locality in 
their programs and try to write programs that exploit it. 





Transposing th | 
processing and scientific computing applications. It is also interesting from a local- 4 
ity point of view because its reference pattern is both row-wise and column-wise. 
For example, consider the following transpose routine: 






1 typedef int array[2] [2]; 

2 

3 void transposei(array dst, array src) 
^ 1 

5 int i, j; 

6 

7 for (i = 0; i < 2; i++) 1 

8 for (j = 0; j < 2; j++) { 
9 dst [j] [i] = sre[i] [j]; 
10 } 

n" } 
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Assume this code runs on a machine with the followihg properties: 


* sizeof(int) = 

* The src array starts at address 0 and the dst array: starts at address 16 
(decimal). 

* There isa single L1 data cache that is direct-mapped, write-through, and write- 
allocate, with a block size of 8 bytes. 

* The cache has a total size of 16 data bytes and the cache is initially empty. 


* Accesses to the src and dst arrays are the only sources of read and write 
misses, respectively. 


A. For each row and col, indicate whether the access to src[row] [col] and 
dst [row] [col] is a hit (h) or a miss (m). For example, Teading src[0] [0] 
is a miss and waiting dst [0] [0] is also d miss. 





dst array src array 

Col. 0 Col. 1 Col. 0 Col. 1 
Row 0 m PENES dis Row0 m f 
Rowl wu Row 1 


' 


of t rH 1 
B. Repeat the problem for a cache with 32 data bytes. 





The Heat of the r recent hitg game e Sim AQuartumia isa tight loop that calculates the 
average position of 256 algae. You are evaluating its cache performance on a 
machine with a 1,024-byte direct-mapped data cache with 16-byte blocks (B = 16). 
You are given the following definitions: 


struct algae_position { 
int x; 
int y; 

}; 


struct algae_position grid[16] [16]; 
int total_x = 0, total_y = 


1 
2 
3 
4 
5 
6 
7 
8 int i, j; 


You should also assume the following: 


* sizeof(int) =4 


* grid begins at memory address 0. 

* The cache is initially empty. 

* The only memory accesses are to the entries of the array grid. Variables i, j, 
total x, and total y are stored in registers. 
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Determine the cache performance-for:the following code: 


1 for (i = 0; i < 16; i+) { 

2 for (j = 0; j < 16; j++) { 
3 total x += grid(il[jl.x; 
4 Y 

5 } 

6 

7 for (i = 0; i < 16; i++) (1 

8 for (j = 0; j < 16; jt+) { 
9 total y += grid[il[j].y; 
10 } 


A 
— 


} 
A. What is the total number of reads? 


B. What is the total number of reads that miss in the cache? 
C. What is the miss raté? 





Given the aavtinptiars sol Practice Problem 6. 18, Homie ifie cadis perii 
mance of the following code: 


auna wno‘ 


Given the ascumnpuons of Práctice Problem 6.18, deteruue the cache é pertot] 


for (i = 0; i < 16; i++){ 
for (j = 0; j < 16; j++) { 
total x += grid[j] [i].x; 
total y += grid[j][iJ.y; 
Y xL. 
H tr L 


A. What is the total number of reads? 

B. What is the total number of reads that miss in the cache? 
C. What is the miss rate? 

D. What would the miss rate be if the cache were twice as big? 


mance of the following code: 


1 
2 
3 
4 
5 
6 


for (i = 0; i < 16; i++){ 
for (j = 0; j < 16; j++) { 
total x += grid[i][j].x; 
total y += grid[i][j].y; í 
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A. What is the total number of reads? 

B. What is the total number of reads that miss in the cache? 
C. What is the miss rate? 

D. What would the miss rate be if the cache were twice as big? 


6.6 Putting It Together: The Impact of Caches 
on Program Performance 


This section wraps up our discussion of the memory hierarchy by studying the im- 
pact that caches have on the performance of programs running on real machines. 


6.6.1 The Memory Mountain 


The rate that a program reads data from the memory system is called the read 
throughput, or sometimes the read bandwidth. If a program reads n bytes over a 
period of s seconds, then the read throughput over that period is n/s, typically 
expressed in units of megabytes per second (MB/s). 

If we were to write a program that issued a sequence of read requests from 
a tight program loop, then the measured read throughput would give us some 
insight into the performance of the memory system for that particular sequence 
of reads. Figure 6.40 shows a pair of functions that measure the read throughput 
for a particular read sequence. 

The test function generates the read sequence by scanning the first elems 
elements of an array with a stride of stride. To increase the available parallelism 
in the inner loop, it uses 4 x 4 unrolling (Section 5.9). The run function is a wrapper 
that calls the test function and returns the measured read throughput. The call 
to the test function in line 37 warms the cache. The £cyc2 function inJine 38 calls 
the test function with arguments elems and estimates the running time of the 
test function in CPU cycles. Notice that the size argument to the run function is 
in units of bytes, while the corresponding elems argument to the test function is 
in units of array elements. Also, notice that line 39 computes MB/s as 10° bytes/s, 
as opposed to 22 bytes/s. 

The size and stride arguments to the run function allow us to control the 
degree of temporal and spatial locality in the resulting read sequence. Smaller 
values of size result in a smaller working set size, and thus better temporal 
locality. Smaller values of stride result in better spatia! locality. If we call the run 
function repeatedly with different values of size and stride, then we can recover 
a fascinating two-dimensional function of read throughput versus temporal and 
spatial locality. This function is called a memory mountain [112]. 

Every computer has a unique memory mountain that characterizes the ca- 
pabilities of its memory system. For example, Figure 6.41 shows the memory 
mountain for an Intel Core i7 Haswell system. In this example, the size varies 
from 16 KB to 128 MB, and the stride varies from 1 to 12 elements, where each 
element is an 8-byte long int. 


= — € r MM á— M À 
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code/mem/mountain/mountain.c 


1 long data [MAXELEMS] ; /* The global array we'll be traversing */ 

2 

3 /* test - Iterate over first "elems" elements of array "data" with 

4 * stride of "stride", using 4 x 4 loop unrolling. ‘ 
5 */ 

6 int test(int elems, int stride) 

7 d 

8 long i, sx2 = stride*2, sx3 = stride*3, sx4 = stride*4; 

9 long accO = 0, acci = 0, acc2 = 0, acc3 = 0; r 
10 long length = elems; 

11 long limit = length - sx4; 

12 

13 /* Combine 4 elements at a time */ 

14 for (i = 0; à « limit; i += sx4) { 

15 accO = accO + datali]; 

16 acci = accl + datali+stride]; t 
17 acc2 = acc2 + data[i*sx2]; 

18 acc3 = acc3 + data[i+sx3}; i 
19 F 

20 

21 /* Finish any remaining elements */ 

22 for (; i < length; i++) { : 
23 accO = accO + datali]; 

24 } 

25 return ((accO + acci) + (acc2 + acc3)); 

26 ) 

27 "n 
28  /* run - Run test(elems, stride) and return read throughput '(MB/s). 

29 * "size" is in bytes, "stride" is in array elements, and Mhz is 

30 * CPU clock frequency in Mhz. 

31 */ 

32 double run(int size, int stride, double Mhz) 

33 1 

34 double cycles; 

35 int elems = size / sizeof(double); 

36 

37 test(elems, stride); /* Warm up the cache */ 

38 cycles = fcyc2(test, elems, stride, 0); /* Call test(elems,stride) */ 

39 return (size / stride) / (cycles / Mhz); /* Convert cycles to MB/s */ 

40 } 


D S 





code/mem/mountain/mountain.c $ 
) t 


Figure 6.40 Functions that measure and compute read throughput. We can generate,amemory mountain § 
for a particular computer by calling the, run function with different values of size (which corresponds to 
temporal locality) and stride (which corresponds to spatial locality). 


a 





Section 6.6 Putting It Together: The Impact of Caches on Program Performance 641 












16,000 f T rene Le 
a 32 KB L1 d-cache 


256 KB L2 cache 
8 MB L3 cache 
64 B block size 


Ridges of 
temporal 
locality 


Read throughput (MB/s) 


128K 









s5 512K 


2M 
Size (bytes) 


s7 
Stride (x8 bytes) s9 8M 


32M 


$1 


1 
128M 


Figure 6.41 A memory mountain. Shows read throughput as a function of temporal and spatial locality. 


The geography of the Core i7 mountain reveals a rich structure. Perpendicular 
to the size axis are four ridges that correspond to the regions of temporal locality 
where the working set fits entirely in the L1 cache, L2 cache, L3 cache, and 
main memory, respectively. Notice that there is more than an order of magnitude 
difference between the highest peak of the L1 ridge, where the CPU reads at a 
rate of over 14 GB/s, and the lowest point of the main memory ridge, where the 
CPU reads at a rate of 900 MB/s. 

On each of the L2, L3, and main memory ridges, there is a slope of spatial 
locality that falls downhill as the stride increases and spatial locality decreases. 
Notice that even when the working set is too large to fit in any of the caches, the 
highest point on the main memory ridge is a factor of 8 higher than its lowest point. 
So even when a program has poor temporal locality, spatial locality can still come 
to the rescue and make a significant difference. 

There is a particularly interesting flat ridge line that extends perpendicular 
to the stride axis for a stride of 1, where the read throughput is a relatively flat 
12 GB/s, even though the working set exceeds the capacities of L1 and L2. This 
is apparently due to a hardware prefetching mechanism in the Core i7 memory 
system that automatically identifies sequential stride-1 reference patterns and 
attempts to fetch those blocks into the cache before they are accessed. While the 





eegener agr m 


fT a 3 








1 
D 
E 
p 
4 
5j 
i 
"4 


—— —— —Ám 


mints. 


b 
j| 
1 
i 
a 
1 
E. 
a 







642 Chapter6 The Memory Hierarchy 


ts Main memory L3 cache L2cache L1 cache 
region region region region 











: 
| 
| 






E 
o 
o 
o 
e 


Read throughput (MB/s) 


Figure 6.42 Ridges of temporal locality in the memory mountain. The graph shows 
a slice through Figure 6.41 with stride = 8. 






























details of the particular prefetching algorithm are not documented, it is clear from 
the memory mountain that the algorithm works best for small strides—yet anothet 
reason to favor sequential stride-1 accesses in your code. 

If we take a slice through the mountain, bolding the stride constant as in Fig- 
| ure 6.42, we can see the impact of cache size and temporal locality on performance. 
For sizes up to 32 KB, the working set fits entirely in the L1 d-cache, and thus 

reads are served from L1 at throughput of about 12 GB/s. For sizes up to 256 KB, 
the working set fits entirely in the unified L2 cache, and for sizes up to 8 MB, the 
working set fits entirely in the unified L3 cache, Larger working set sizes are served 
primarily from main memory. 

The dips in read throughputs at the leftmost edges of the L2 and L3 cache 
regions—where the working set sizes of 256 KB and 8 MB are equal tq their 
respective cache sizes—are interesting. It is not entirely clear why these dips occur. 
The only way to be sure is to perform a detailed cache simulation, but it is likely 
that the drops are caused by conflicts with other code and data lines. 

Slicing through the memory mountain in the opposite direction, holding the 
working set size constant, gives us some insight into the impact of spatial locality on 
the read throughput. For example, Figure 6.43 shows the slice for a fixed working 
set size of 4 MB. This slice cuts along the L3 ridge in Figure 6.41, where the working 
set fits entirely in the L3 cache but is too large for the L2 cache. 

Notice how the read throughput decreases steadily as the stride increasesfrom 4 
one to eight words. In this region of the mountain, a read miss in L2 causes a F 
block to be transferred from L3 to L2. This is followed by some number of hits 
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Figure 6.43 A slope of spatial locality. The graph shows a slice through Figure 6.41 
with size = 4 MB. 


on the block in L2, depending on the stride. As the stride increases, the ratio of 
L2 misses to L2 hits increases. Since misses are served more slowly than hits, the 
read throughput decreases. Once the stride reaches eight 8-byte. words, which on 
this system equals the block size of 64 bytes, every read request misses in L2 and 
must be served from L3. Thus, the read throughput for strides of at least eight is 
a constant rate determined by the rate that cache blocks can be transferred from 
L3 into L2. 

To summarize our discussion of the memory mountain, the performance of the 
memory system is not characterized by a single number. Instead, it is a mountain 
of temporal and spatial locality whose elevations can vary by, over an order of 
magnitude., Wise programmers try to structure their programs so that they run in 
the peaks instead of the valleys. The aim is to exploit temporal locality so that 
heavily used words are fetched from the L1 cache, and to exploit spatial locality 
so that as many words as possible are accessed from a single L1 cache line. 





Practice Problem:6.21- olütion hage gboya *» 
üs the 1 memory mountain in Figure 6.41 to estiniate the time, in CPU cycles, to 
read an 8-byte word from the L1 d-cache. 


23 





6.6.2. Rearranging Loops to Increase Spatial Locality 


Consider the problem of multiplying a pair of n x n matrices: C = AB. For exam- 


ie €12 | = ie ‘ayy | | ba bp | 
C21 €22 an 422) 2, b» 


ple, if n = 2, then 
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where 
C41 = aybi + 12971 
c12 = 411512 + a12522 
c21 = dbi + d22b21 
cn = A 1b12 + à25b22 


A matrix multiply function is usually implemented using three nested loops, which 
are identified by their indices i, j, and k. If we permute the loops and make some 
other minor code changes, we can create the six functionally equivalent versions 
of matrix multiply shown in Figure 6.44. Each version is uniquely identified by the 
ordering of its loops. 

At a high level, the six versions are quite similar. If addition is associative, 
then each version computes an identical result, Each version performs O (n>) total 
operations and an identical number of adds and multiplies, Each of the n? elements 
of A and B is read n times. Each of the n? elements of C is computed by summing 
n values. However, if we analyze the behavior of the innermost loop iterations, we 
find that there are differences in the number of accesses and the locality. For the 
purposes of this analysis, we make the following assumptions: 


è Each array is an n x n array of double, with ‘sizeof(double) = 8. 
e There is a single cache with a 32-byte block size (B = 32). 
e The array size n is so large that a single matrix row does not fit in the L1 cache. 


* The compiler,stores local variables in registers, and thus references to local 
variables inside loops do not require any load or store instructions. 


Figure 6.45 summarizes the results of our inner-loop analysis. Notice that the 
six versions pair up into three equivalence classes, which we denote by the pair of 
matrices that are accessed in the inner loop. For example, versions ijk and jik are 
members of class AB because they reference arrays A and B (but not C) in their 
innermost loop. For each class, we have counted the number of loads (reads) and 
stores (writes) in each inner-loop iteration, the number of references to A, B, and 
C that will miss in the cache in each loop itefation, and the total number of cache 
misses per iteration. 

The inner loops of the class AB routines (Figure 6.44(a) and (b)) scan a row 
of array A with a stride of 1. Since each cache block holds four 8-byte words, the 
miss rate for A is 0.25 misses per iteration. On the other hand, the inner loop scans 
a column of B with a stride of n. Since n is large, each access of array B results in 
a miss, for a total of 1.25 misses per iteration. 

The inner loops in the class AC routines (Figure 6.44(c) and (d)) have some 
problems. Each iteration performs two loads and a store (as opposed to the 


1. As we learned in Chapter 2, floating-point addition is commutative, but in general not associative. 1 
In practice, if the matrices do not mix extremely large values with extremely small ones, as often is. 4 
true when the matrices store physical properties, then the assumption of associativity is reasonable. 
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(a) Version ijk (b) Version jik 
— — — — — code/mem/matmult/mm.c ——————— — —— code/mem/matmult/mm.c 
1 for (i = 0; i < n; i++) 1 for (j = 0; j < n; j++) 
2 for (j = 0; j «n; j+) { 2 for (i = 0; i <n; i++) { 
3 sum = 0.0; 3 sum = 0.0; 
4 for (k = 0; k < n; k++) 4 for (k = 0; k < n; k++) 
5 sum += A[i] [k]*B[k] [j]; 5 sum += A[i] [k]*B[k] [j]; 
6 Cli] [j] += sum; 6 Cli) [j] += sum; 
7 } 7 } 
code/mem/matmult/mm.c ——————— — — — code/mem/matmultmm.c 
(c) Version jki (d) Version kji 
———————— — —  code/mem/maimult/mm.c — — — code/mem/matmult/mm.c 
1 for (j= 0; j <n; j++) 1 for (k = 0; k < n; k++) 
2 for (k = 0; k < n; k++) { 2 for (G =0; j <n; j++) { 
3 r = B[k] [j]; 3 r = B[k] [j]; 
4 for (i = 0; i < n; itt) A for (i= 0; i < n; i++) 
5 Cli] [j] += Ali] [k]*r; 5 C[i][jl += Ali] [k]*r; 
6 } 6 } 
code/mem/matmul/mm.c m code/mem/matmult/mm.c 
t ^ 
(e) Version kij (f) Version iki. 


code/mem/matmuli/mm.c o — — — code/mem/matmult/mm.c 
r 2 
for (k = 0; k < n; k++) for (i = 0; i < n; i++) 


1 1 

2 for (i = 0; i < n; i++) { 2 for (k = 0; k < n; k++) { 

3 r = AÑ] [k]; . 3 r = A[i] [k]; 

4 ifor (j = 0; j < n; j++) 4 for (j = 0; j < n; j++) 
5 C[i] [j] += r*B[k] [j]; 5 e Ci] Cj] += r*B[k] [j]; 
6 } ; 6 } 


code/mem/matmult/mm.c —r mr — ——— code/menvmatmult/mm.c 


Figure 6.44 Six versions of matrix multiply. Each version is uniquely identified by,the ordering of its loops. 


Matrix multiply Per iteration | |  , 

version (class) Loads Stores A misses B misses ,,C misses Total misses 
ijk & jik (AB) 2 0 0.25 1.00 0.00 1:25 

jki & kji (AC) 2 1 1.00 "0.00 100 , 2.00 

kij & ikj (BC) 2 1 0.00 p.25 0.25. 0.50 


Figure 6.45 Analysis of matrix multiply inner loops. The six versions partition into 
three equivalence classes, denoted by the pair of arrays that are accessed in the inner 
loop. ‘ 
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Figure 6.46 Core i7 matrix multiply performance. 


class AB routines, which perform two loads and no stores). Second, the inner 
loop scans the columns of A and C with a stride of n. The result is a miss on each 
load, for a total of two misses per iteration. Notice that interchanging the loops 
has decreased the amount of spatial locality compared to the class AB routines. 

The BC routines (Figure 6.44(e) and (f)) present an interesting trade-off: With 
two loads and a store, they require one more memory operation than the AB 
routines. On the other hand, since the inner loop scans both B and C row-wise 
with a stride-1 access pattern, the miss rate on each array is only 0.25 misses per 
iteration, for a total of 0.50 misses per iteration. 

Figure 6.46 summarizes the performance of different versions of matrix mul- 
tiply on a Core i7 system. The graph plots the measured number of CPU cycles 
per inner-loop iteration as a function of array size (n). 

There are a number of interesting points to notice about this graph: 


* For large values of n, the fastest version runs almost 40 times faster than the | 
slowest version, even though each performs the same number of floating-point 4 
arithmetic operations. ; 


e Pairs of versions with the same number of memory references and misses per : 
iteration have almost identical measured performance. 
e The two versions with the worst memory behavior, in terms of the number of à 
accesses and misses per iteration, run significantly slower than the other four i 
. . . 7 J 
versions, which have fewer misses or fewér accesses, or both. 


* Miss rate, in this case, is a better predictor of performance than the total 
number of memory accesses. For example, the class BC routines, with 0.5 
misses per iteration, perform much better than the class AB routines, with j 
1.25 misses per iteration, even though the class BC routines perform more 
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Web Aside MEM: BLOCKING Usirig blocking to increase se temporal locality 


There f is an [M g techhjque called blocking that cani improve “the:temporal. locality of inner loops. 
"The general i idea of blocking is to ‘organize the data structures, ina program | into large chunks called 
s blacks, (p this context, “block” fefers toan application- “level chiunk of data,’ not f a'cache block.) The 
program is structured so that it loads a a chunk 1 into the L1 caghe, does a all the reads and writes'that it 
‘heeds to on that chunk; then discards the chunk, loads in | the;riext chunk; a and so o on. 
Unlike the simple loop transformations fóri improving spatiàl locality, blocking makes the code 
" harder tó read and understand. For this reason, it is best suited for optimizing compilers or frequently 
executed library’ routines. Blocking does; not improve thie , Performance of. matrix multiply on the Core 
17, becausé of its Sophisticated. prefetching hardware. Still, the techniqüé i is y ihtejésting to study and 
,undefstand because it is @ general concèpt that, can produce big performance gains on systerns that 
: don’ t prefetch. s 
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mernory references in the inner loop (two loads and one store) than the class 
AB routines (two loads). 


* For large values of n, the performance of the fastest pair of versions (kij and 
ikjyis constant. Even though the array is much larger than any of the SRAM 
cache memories, the prefetching hardware is smart enough to recognize the 
stride-1 access pattern, and fast enough to keep up with memory accesses 
in the tight inner loop. This is a stunning accomplishment by the Intel engi- 
neers who designed this memory system, providing even more incentive for 
programmers to develop programs with good spatial locality. 


6.6.3 Exploiting Locality in Your Programs 


As we have seen, the memory system is organized as a hierarchy of storage 
devices, with smaller, faster devices toward the top and larger, slower devices 
toward the bottom. Because of this hierarchy, the effective rate that a program 
can access memory locations is not characterized by a single number. Rather, it is 
a wildly varying function of program locality (what we have dubbed the memory 
mountain) that can vary by orders of magnitude. Programs with good locality 
access most of their data from fast cache memories. Programs with poor locality 
access most of their data from the relatively slow DRAM main memory. 

Programmers who understand the nature of the memory hierarchy can ex- 
ploit this understanding to write more efficient programs, regardless of the specific 
memory system organization. In particular, we recommend the following tech- 
niques: 


* Focus your attention on the inner loops, where the bulk of the computations 


. and memory accesses occur. 


* Try to maximize the spatial locality in your programs by reading data objects 
sequentially, with stride 1, in the order they are stored in memory. 


j 
j 
i 
4 
TR 
| 


* Try to maximize the temporal locality in your programs by using a data object 
as often as possible once it has been read from memory. 
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6.7 Summary 


The basic storage technologies are random access memories (RAMS), nonvolatile 
memories (ROMs), and disks. RAM comes in two basic forms. Static RAM 
(SRAM) is faster and more expensive and is used for cache memories. Dynamic 
RAM (DRAM) is slower and less expensive and is used for the main memory and 
graphics frame buffers. ROMs retain their information even if the supply voltage 
is turned off. They are used to store firmware. Rotating disks are mechanical non- 
volatile storage devices that hold enormous amounts of data at a low cost per bit, 
but with much longer access times than DRAM. Solid state disks (SSDs) based 
on nonvolatile flash memory are becoming increasingly attractive alternatives to 
rotating disks for some applications. 

In general, faster storage technologies are more expensive per bit and have 
smaller capacities. The price and performance properties of these technologies 
are changing at dramatically different rates. In particular, DRAM and disk access 
times are much larger than CPU cycle times. Systems bridge these gaps by orga- 
nizing memory as a hierarchy of storage devices, with,smaller, faster devices at 
the top and larger, slower devices at the bottom. Because well-written programs 
have good locality, most data are served from the higher levels, and the effect is 
a memory system that runs at the rate of the higher levels, but at the cost and 
capacity of the lower levels. 

Programmers can dramatically improve the running times of their programs 
by writing programs with good spatial and temporal locality. Exploiting SRAM- 
based cache memories is especially important. Programs that fetch data primarily 
from cache memories can run much faster than programs that fetch data primarily 
from memory. 


Bibliographic Notes 


Memory and disk technologies change rapidly. In our experience, the best sources 
of technical information are the Web pages maintained by the manufacturers. 
Companies such as Micron, Toshiba, and Samsung provide a wealth of current 
technical information on memory devices. The pages for Seagate and Western 
Digital provide similarly useful information about disks. 

Textbooks on circuit and logic design provide detailed information about 
memory technology [58, 89]. IEEE Spectrum published a series of survey arti- 
cles on DRAM [55]. The International Symposiums on Computer Architecture 
(ISCA) and High Performance Computer Architecture (HPCA) are common fo- 
rums for characterizations of DRAM memory performance [28, 29, 18]. 

Wilkes wrote the first paper on cache memories [117]. Smith wrote a clas- 
sic survey [104]. Przybylski wrote an authoritative book on cache design [86]. 
Hennessy and Patterson provide a comprehensive discussion of cache design is- 
sues [46]. Levinthal wrote a comprehensive performance guide for the Intel Core 
i7 [70]. 

Stricker introduced the idea of the memory mountain as a comprehensive 
characterization of the memory system in [112] and suggested the term "memory 
mountain" informally in later presentations of the work. Compiler researchers 





Homework Problems 


work to increase locality by automatically performing the kinds of manual code 
transformations we discussed in Section 6.6 [22, 32, 66, 72, 79, 87,4119]. Carter 
and colleagues have proposed a cache-aware memory controller [17]. Other re- 
searchers have developed cache-oblivious algorithms that are designed to run well 
without any explicit knowledge of the structure of the underlying cache mem- 
ory [30, 38, 39, 9]. 

There is a large body of literature on building and using disk storage. Many 
Storage researchers look for ways to aggregate individual disks into larger, more 
robust, and more secure storage pools [20, 40, 41, 83, 121]. Others look for ways 
to use caches and locality to improve the performance of disk accesses (12, 21]. 
Systems such as Exokernel provide increased user-level control of disk and mem- 
ory resources [57]. Systems such as the Andrew File System [78] and Coda [94] 
extend the memory hierarchy acróss computer networks and mobile notebok 
computers. Schindler and Ganger devéloped an interesting tool that automatically 
characterizes the geometry and perforniance of SCSI disk drives [95]. Researchers 
have investigated techniques for building and using flash-based SSDs [8, 81]. 


Homework Problems 


6.22 99 

Suppose you are asked to design a rotating disk where the number of bits per 
track is constant. You know that the number of bits per track is determined 
by the circumference of the innermost track, which you can assume is also the 
circumference of the hole. Thus, if you make the hole in the center of the disk 
larger, the number of bits per track increases, but the total number of tracks 
decreases, If you let r denote the radius of the platter, and x - r the radius of the 
hole, what value of x maximizes the capacity’of the disk? 


6.23 9 

Estimate the average time (in ms) to access a sector on the following disk: 
Parameter Value 

—— m M 

Rotational rate 15,000 RPM 

Tavg seek 4 ms 

Average number of sectors/track 800 

6.24 99 


Suppose that a 2 MB file consisting of 512-byte logical blocks is stored on a disk 
drive with the following characteristics: 


Parameter Value 
eee 

Rotational rate 15,000 RPM 

Tovg seek 4 ms à 
Average number of sectors/track 1,000 

Surfaces 8 


Sector size 512 bytes 
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For each case below, suppose that a program reads the logical blocks of the 
file sequentially, one after the other, and that the time to position the head over 
the first block is Tayg seek + Tavg rotation: 


A. Best case: Estimate the optimal time (in ms) required to read the file over 
all possible mappings of logical blocks to disk sectors. 


B. Random case: Estimate the time (in ms) required to read the file if blocks 
are mapped randomly to disk sectors. 


E 


6.25 9 

The following table gives the parameters for a number of different caches. For 
each cache, fill in the missing fields in the table. Recall that m is the number of 
physical address bits, C is the cache size (number of data bytes), B is the block 
size in bytes, E is the associativity, S is the number of cache sets, t is the number of 
tag bits, s is the number of set index bits, and b is the number of block offset bits. 


Cache m C b 


D EMEN as 


32 1,024 
32 1,0024 
32 1,024 
32 1,024 
32 1,024 
32 1,024 


6.26 9 

The following table gives the parameters for a number of different caches. Your 
task is to fill in the missing fields in the table. Recall that m is the number of physical 
address bits, C is the cache size (number of data bytes), B is the block size in bytes, 
E is the associativity, S is the number of cache sets, t is the number of tag bits, s is 
the number of set index bits, and b is the number of block offset bits. 


Cache m 


1. 32 

2. 2,048 
3. 1,024 
4. 1,024 


6.27 € 
This problem concerns the cache in Practice Problem 6.12. 


A. List all of the hex memory addresses that will hit in set 1. 
B. List all of the hex memory addresses that will hit in set 6. 


6.28 99 
This problem concerns the cache in Practice Problem 6.12. 


A. List all of the hex memory addresses that will hit in set 2. 
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B. List all of the hex memory addresses that will hit in set 4. 
C. List all of the hex memory addresses that will hitin get 5. 
D. List all of the hex memory addresses that will hit in set 7. n 
6.29 $e an 
Suppose we have a system with the following properties: 


* The memory is byte addressable. 
* Memory accesses are to 1-byte words (not to 4-byte words). 
'. Addresses are 12’bits wide. $ 
s "E 1 ‘ ^ 
į * The cache is two-way set associative (Ez 2), with a 4-byte block size (B = 4) 
and four sets.(S = 4). 7 


1 E 


The contents of the cache are as follows, with all addfegses,'tags, and values givén 
in hexadecimal notation: 


Setindex Tag Valid — Byte0 — Byte1 . Byte2 Byte3 


0 00 1 40 41 42 43 
83 1 FE 97 cc DO’ 
1 00 1 44 45 46 47 : 
83 0 = 2s = = 
2 00 1 ‘48 49 4A ^ 4B 
40 0 = = =i = 
3j FF 1 9X CO 03 FF 
60 0 n = = — 


A. The following diagram shows*the format of an address (1 bit per box). 
Indicate (by labeljng the diagram) the fields that would be used to determine 
the following: 


CO. The cache block offset 
CI. The cache set index i 
CT. The cache tag 


€ : a ; 
B. For each of the following memory accesses, indicate if it will be a cache hit 
or miss when carried out in sequence as listed. Also give the value of a read 
if it can be inferred from the information in the cache. 





Operation Address Hit? “Read value (or unknown) 
Read 0x834 à =— 
Write 0x836 


Read OxFFD 
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6.30 9 
Suppose we have a system with the following properties: 


e The memory is byte addressable. 


* Memory accesses are to 1-byte words (not to 4-byte words). 


e Addresses are 13 bits wide. 
* The cache is 4-way set associative (E = 4), with a 4-byte block size (B —4) 
and eight sets (S — 8). 


Consider the following cache state. All addresses, tags, and values are given in 
hexadecimal format. The Index column contains the set index for each set of four 
lines. The Tag columns contain the tag value for each line. The V columns contain 
the valid bit for each line. The Bytes 0-3 columns contain the data for each line, 
numbered left to right starting with byte 0 on the left. 


4-way set associative cache 





Index 


0 


Tag V Bytes03 


FO 
BC 
BC 
BE 
TE 
98 
38 
8A. 


Foor Or OF 


ED 320A A2 
03 3E CD 38 
549E 1E FA 
2F 7E 3D A8 
32211C2C 
A9 76 2B EE 
5D 4D F7 DA 
04 2A 32 6A 


Tag V Bytes 0-3 


8A 


1 


BF 80 1D FC 
16 7B ED 5A 
DC81B214 
27 95 A474 
22C2DC34 
BC91 D5 92 
69 C2 8C 74 
B1 86 56 0E 


Tag V  Bytes0-3 


14 
BC 
00 
C4 
BC 
98 
8A 
cc 


re PRP ese FP Oo oY = 


EF 09 862A 
8E 4C DF 18 
B6 1F 7B 44 
07 11 6B D8 
BA DD 37 D8 
80 BA 9B F6 
A8 CE 7F DA 
963047 F2 


A. What is the size (C) of this cache in bytes? 


B. The box that follows shows the format of an address'(1 bit për box). Indicate 
(by labeling the diagram) the fields that would be used to: détermine the 
following: 


CO. The cache block offset 


CI. The cache set index 
CT. The cache tag 


Tag V  Bytes0-3 


25 446F1A 
FB B7 1202 
10 F5 B82E 
C7 B7 AF C2 
E7 A239 BA 
48 16 81 0A 
FA 93 EB 48 
F81D 4230 








COOLER 
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Suppose that a program using tHe cache in Problem 6.30 references thé 1-byte 
word at address 0x071A. Indicate the cache entry accessed and the cache byte 
value returned in hex. Indicate whether a cache miss occurs. If there is a cache 
miss, enter *—" for “Cache byte returned.” Hint: Pay attention to those valid | 


bits! 





A. Address format (1 bit per box): 


B. Memory reference: 














Parameter ^ > Value 
Block offset (CO) Ox 
Index (CI) Ox — 
Cache tag (CT) Ox. 
Cache hit? (Y/N) — 
‘Cache byte returned Ox 
" Ji t 
6:32 *9. [61i te 4 H 


Repeat Problem 6.31 for eon address Ox16E8. 
A g format (1 bit per box): 








12 11 10 9 8 7 6 5 


B. Memory reference: 4 








Parameter Value 

Cache offset (CO) Ox... 

Cache index (CI) Ox 

Cache tag (CT) Ox y 





Cache hit? (Y/N) 
‘Cache byte,returned — Ox |... 
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M 
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For the cache in Problem 6,30, list the eight memory addresses (in hex) that will 


hit in set 2. zi 


6.34 99 
Consider the following matrix transpose routine: 


1 typedef int array[4] [4]; 
2 
3. Void-transpose?(array dst, array sr¢) 


4 
5 int i, j; 
6 
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7 for (i = 0; i < 4; i++) { 
8 for (j = 0; j < 4; j++) { 
9 ast [j] [i] = src[i] [j]; 
10 } 
11 } 
12 } 
Assume this code runs on a machine with the following properties: 
e sizeof(int) = 4. 
« The src array starts at address 0 and the dst array starts at address 64 
(decimal). 
e There is a single L1 data cache that is direct-mapped, write-through, write- 
allocate, with a block size of 16 bytes. " 
e The cache has a total size of 32 data bytes, and the cache is initially empty. 
e Accesses to the src and dst arrays are the only sources of read and write 
misses, respectively. 
A. For each row and col, indicate whether the access to src [row] [col] and 
dst [row] [col] is a hit (h) or a miss (m). For example, reading src [0] [0] 
is a miss and writing dst [0] [0] is also a miss. P 
dst array src array 
Col. 0 Col. 1 Col. 2 Col. 3 Col. 0 Col. 1 Col. 2 Col. 3 
Row 0 m iic utm 3 es Row 0 m eMe EE a 
Bowl “220%. olneznne Aste. See Rowi zm eored ee ee 
ROW2 Lut. Spel uem y Row2) Lx “2205. “45555. =. 
Row3 chee ee 0 CLA Rows 2.55. 23) “pee 0 m—— 
6.35 9€ 
Repeat Problem 6.34 for a cache with a total size of 128 data bytes. 
dst array src array 
Col.0 Coli  Cóf2  CoL3 Col.0 Coli  ColL2  Coói3 
RowO  .... ~~ sess Row0 ao (See ee c 
Rowi omano" e Gee a OS RoWl. sune aa a ud 
Row2: eann munis (Ln “SSS Row.2) Lot. 53 a an 
Row 3 Eure MENS ROW. dunne tmu. Lulu oum 
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This problem tests your ability to predict the cache behavior of C code: You are 


given the following code to analyze: 





Homework Problems 
int sum = 0; 
for (i = 0; i < 128; i++) { 


sum += x[0] [i] * x[1J [i]; 
} 


Assume we execute this under the following conditions: 


* sizeof(int) = 4. 
* Array x begins at memory address 0x0 and is stored in row-major order. 


* [n each case below, the cache is initially empty. 


* Theonly memory accesses are to the entries of the array x. All other variables 
are stored in registers. 


Given these assumptions, estimate the miss rates for the following cases: 
. Case 1: Assume the cache is 512 bytes, direct-mapped, with 16-byte cache 
blocks. What is the miss rate? 
. Case 2: What is the miss rate if we double the cache size to 1,024 bytes? 


. Case 3: Now assume the cache is 512 bytes, two-way set associative using 
an LRU replacement policy, with 16-byte cache blocks. What is the cache 
miss rate? 

. For case 3, will a larger cache size help to reduce the miss rate? Why or 
why not? 


E. For case 3, will a larger block size help to reduce the miss rate? Why or why 
not? 


6.37 99 

This is another problem that tests your ability to analyze the cache behavior of C 
code. Assume we execute the three summation functions in Figure 6.47 under the 
following conditions: 


* sizeof(int) = 4. 
* The machine has a 4 KB direct-mapped cache with a 16-byte block size. 


* Within the two loops, the code uses memory accesses only for the array data. 
Thé loop indices and the value sum are.held in registers. 


* Array a is stored starting at memory address 0x08000000. 
Fill in the table for the approximate cache miss rate for the two cases N = 64 
and N = 60. 
Function -N=64 N=60 
sunA 
sumB 
sumC 


655 
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typedef int array t[N](N]; 


i 

2 

3 int sumA(array t a) 

4 1 

5 int i, j; 

6 int sum = 0; 

7 for (i = 0; i < N; i++) 

8 for (j = 0; j <N; j++) t 
9 sum += ali] [j]; 

10 } d 

11 return sum; 

12 

13 

14 sumB(array_t a) 

15 

16 int i, j; 

17 int sum = 0; 

18 for (j = 0; j <N; jm. 

19 for (i = 0; i < N; i++) i 
20 sum += a[il[jl; 

21 b 

22 return sum; 

23 

24 

25 int sumC(array t a) 

26 

27 int i, j; 

28 int sum = 0; 

29 for (j = 0; j < N; j*-2 

30 for (i = 0; i < N; it=2) 1 
31 sum += (a[i][jl + a[i*11[jl 
32 + a[i][j*1] + aliti][j+1]); 
33 } 

34 return sum; 

35 ) 


Figure 6.47 Functions referenced in Problem 6.37. 


6.38 € 

3M decides to make Post-its by printing yellow squares on white pieces of paper. | 
As part of the printing process, they need to set the.CMYK (cyan, magenta, yellow, | 
black) value for every point in the square. 3M hires you to determine the efficiency | 
of the following algorithms on a machine with a 2,048-byte direct-mapped data 1 
cache with 32-byte blocks. You are given the following definitions: 
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struct point color { 
int c; 
int m; 
1 Ff 
int y; 
int k; 








3 






struct point color square[16][16]; 
int i, j; 


COON Ot A WN = 






Assume the following: . 


* sizeof(int) = 4. 
* square begins at memory address 0. 
* The cache is initially empty. 







* The only memory accesses are to the entries of the array square. Variables i 
and j are stored in registers. 







Determine the cache performance of the following code: 





1 for (i = 0; i < 16; ie) 

2 for (j = 0; j < 16; j++) { 

3 square fi]i[j].c = 0; 

4 square[i][j].m = 0; 

5 square [i] [j].y, = 1; 

6 square [i] (j].k = 0; 

7 } ( 
8; } 


“| 


. 2Y ! t 
A, What-s the total number of writes? 
B., What is the total number of writes'that miss in the cache? 
C. What is the miss rate? 






6.39 9 


Given the assumptions in Problem 6.38, determine the cache performance of the 
following code: 


1 for (i = 0; i < 16; i++){ 

2 for (j = 0; j < 16; j++) { 

3 square[jl[il.c = 0; 

4. square[j] [i].m = 0; x 
5 square[jl[il.y = 1; 

6 square[j] [i].k = 0; 

7 } 

8 } 


658 Chapter6 The Memory Hierarchy 


A. What is the total number of writes? 
B. What is the total number of writes that miss in the cache? 
C. What is the miss rate? 


6.40 € 
Given the assumptions in Problem 6.38, determine the cache performance of the 
following code: 


for (i = 0; i < 16; i++) 1 
for (j = 0; j < 16; j++) ( 
square [i] [j].y = 1; 
} 
} 
for (i = 0; i < 16; i++) { 
for (j = 0; j < 16; j++) f 
square [i] [j].c = 0; 
square [i] [j].m = 0; 
square [i] [j].k = 0; 


wan AY bt WN a 


-à = 
2 O 


} 


A. What is the total number of writes? 
B. What is the total number of writes that miss in the cache? 
C. What is the miss rate? í 


— 
N 
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You are writing a new 3D game that you hope will earn you fame and fortune. You 
are currently working on a function to blank the screen buffer before drawing the 
next frame. The screen you are working with is a 640 x 480 array of pixels. The 
machine you are working on has a 64 KB direct-mapped cache with 4-byte lines. 
The C structures you are using are as follows: 


struct pixel { 
char r; 
char g; 
char b; 
char a; 


struct pixel buffer [480] [640]; 
int i, j; 

char *cptr; 

int *iptr; 


à 
2 
3 
4 
5 
6 Hh 
7 
8 
9 
0 
1 


Assume the following: 


ə sizeof(char) = 1 and sizeof(int) = 4. 
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* buffer begins at memory address 0. 
* The cache is initially empty. 


* The only memory accesses are to the entries of the array buffer. Variables i, 
j, cptr, and iptr are stored in registers. 


' What perceritage of writes in thefollowifg code will miss in the cache? 
" 1 ul 1 


1 for (j = 0; j < 640; j+¥) { 

2 for (i = 0; i < 480; itc 

3 buffer [i] [j].x = 0; 

4 buffer [i] [j] .g 4 d; 

3 buffer [il [j] b * 0; 

6 ^ buffer[i) [j].a = 0; 

7 } 

8 

Given the assumptions in Problem 6.41, what percentage of writes inthe following 
code will miss iñ the cache? 


1 char *cptr = (char *) buffer; 

2 for (; cptr < (((char *) buffer) + 640 480 * 4); cptr++) 
3 *cptr = 0; 
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Given the assumptions in Problem 6.41, what percentage of writes in the following 


3 4 
6.42 @ 

4 

| 

i 

| 

i 

| 

| 

: code will miss in the,cache? 


1 int *iptr - (int *)buffer; 


2 for (; iptr « ((int *)buffer + 640*490); iptr++) 
3 *iptr = 0; 
6.44 999 i 


Download the mountain program from the CS:APP Web site and run it on your 
favorite PC/Linux system. Use the results to'estimate the sizes of the caches on 
your system. 


6.45 994 
' Inthis assigriment, you will apply the concepts you learned in Chapters 5 and 6 
to the problem of optimizing code for a memory-intensive application. Consider 
a procedure to copy and transpose the elements of an N x N matrix of type int. 
That is, for source matrix S and destination matrix D, we want to copy'each 
element s; , to d; ;. This code can be writtenewith a simple loop, 


void transpose(int *dst, int *src, int dim) 
{ 


1 
2 
3 int i, j; 
4 
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for (i = 0; i < dim; i++) 
for (j = 0; j < dim; j++) 
dst[j*dim + i] = src[ixdim + jl; 


} 


where the arguments to the procedure are pointers to the destination (dst) and 
source (src) matrices, as well as the matrix size N (dim). Your job is to devise a 
transpose routine that runs as fast as possible. 


6.46 999 

This assignment is an intriguing variation of Problem 6.45. Consider the problem 
of converting a directed graph g into its undirected counterpart g’. The graph 
g' has an edge from vertex u to vertex v if and only if there is an edge from u 
to v or from v to u in the original graph g. The graph g is represented by its 
adjacency matrix G as follows. If N is the number of vertices in g, then G is an 
N x N matrix and its entries are all either 0 or 1. Suppose the vertices of g are 
named vo, v4, v, ..., Uy 4. Then G[i][j] is 1 if there is an edge from v, to v; and 
is 0 otherwise. Observe that the elements on the diagonal of an adjacency matrix 
are always 1 and that the adjacency matrix of an undirected graph is symmetric. 
This code can be written with a simple loop: 


void col.convert(int *G, int dim) { 
int i, j; 


for (j = 0; j « dim; j++) 


1 

2 

3 

4 for (i = 0; i < dim; i++) 

5 

6 G[j*dim + i] = G[j*dim + i] |l G[i*dim + j]; 
7 


} 


Your job is to devise a conversion routine that runs as fast as possible. As 
before, you will need to apply concepts you learned in Chapters 5 and 6 to come 
up with a good solution. 


Solutions to Practice Problems 


Solution to Problem 6.1 (page 584) 

The idea here is to minimize the number of address bits by minimizing the aspect 
ratio max(r, c)/ min(r, c). In other words, the squarer the array, the fewer the 
address bits. 


Organization max(b,, be) 


16 x1 2 
16x 4 

128 x 8 

512 x 4 

1,024 x 4 
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Solution to Problem 6.2 (page 592) 
The point of this little drill is to make sure you understand the relationship. between 


cylinders and tracks, Once you have that straight, just plug and chug: * 
: : 512 bytes 400 sectors 10,000 tracks 2surfaces 2 platters 
Disk capacity = ————— x —— 9——— x» 2x £x — — 
sector track surface platter disk 
= 8,192,000,000 bytes 
= 8.192 GB 


Solution to Problem 6.3 (page 595) 


The solution to this problem is a straightforward' application of the formula for 
disk access time. The average rotational latency (in ms) is 


Tayg rotation = 1/2 X Tmax rotation 
= 1/2 x (60 secs/15,000 RPM) x 1,000 ms/sec 
«2ms ` 
The average transfer time is 
Tavg transfer = (60 secs/15,000 RPM) x 1/500 sectors/track x 1,000 ms/sec 
= 0.008 ms 


Putting it all together, the total estimated access time,is 
E 


Taccess = Lave seek ar Tayg rotation + Tavg transfer 
= 8ms + 2 ms + 0.008 ms 
ü 7: 10 ms 
Solution to Problem 6.4 (page 595) n 


This is a good check of your understanding of the factors that affect disk perfor- 
mance. First we need to determine a few basic properties of the file and the disk. 
The file consists of 2,000 512-byte logical blocks. For the disk, Tavg seek Hs ms, 
Tmax rotation = 6 ms, and T; avg rotation = 3 ms. 


A. Best case: In the optimal case, the blocks are.mapped to contiguous sectors, 
on the same cylinder, that can be read one after the other without moving 
the head. Once the head is positioned over the first-sector it takes two full 
rotations (1,000 sectors per rotation) of the disk to read all 2,000 blocks. 
So the total time to read the file is T, 
5+3 + 12 = 20 ms. 


B. Random case: In this case, where blocks are mapped randomly to sectors, 
reading each of the 2,000 blocks requires Tavg seek + Tavg rotation MS, SO the to- 
taltime to read the file is (Tavg seek + Tavg rotation) X 2,000 = 16,000 ms (16 sec- 
onds!). 


You can see now why it’s often a good idea to defragment your disk drive! 


vg seek + Tayg rotation + 2x Tmax rotation = 
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Solution to Problem 6.5 (page 601) 

This is a simple problem that will give you some interesting insights into the feasi- 
bility of SSDs. Recall that for disks, 1 PB = 10? MB. Then the following straight- 
forward translation of units yields the following predicted times for each case: 


A. Worst-case sequential writes (470 MB/s): 
(10? x 128) x (1/470) x (1/(86,400 x 365)) ~ 8 years 
B. Worst-case random writes (303 MB/s): 
(10? x 128) x (1/303) x (1/ (86,400 x 365)) ~ 13 years 
C. Average case (20 GB/day): 
(10° x 128) x (1/20,000) x (1/365) ~= 140 years 


So even if the SSD operates continuously, it should last for at least 8 years, which 
is longer than the expected lifetime of most computers. 


Solution to Problem 6.6 (page 604) 

In the 10-year period between 2005 and 2015, the unit price of rotating disks 
dropped by a factor of 166, which means the price is dropping by roughly a factor 
of 2 every 18 months or so. Assuming this trend continues, a petabyte of storage, 
which costs about $30,000 in 2015, will drop below'$500 after about seven of these 
factor-of-2 reductions. Since these are occurring every 18 months, we might expect 1 
a petabyte of storage to be available for $500 around the year 2025. 


Solution to Problem 6.7 (page 608) j 
To create a stride-1 reference pattern, the loops must be permuted so that the | 
rightmost indices change most rapidly. 


int sumarray3d(int a[N] [N] [N]) 
{ 


int i, j, k, sum = 0; 


for (i = 0; i < N; i++) f 
for (j = 0; j < N; j++) { 


, 
2 

3 

4 

5 for (k = 0; k <N; k++) f 
6 

7 

8 sum += alk] [i] [j]; 
9 


} 
10 } 
11 } 
12 return sum; 


13 } 


This is an important idea. Make sure you understand why this particular loop 3 
permutation results'in a stride-1 access pattern. 
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Solution to Problem 6.8 (page 609) 

The key to solving this problem is to visualize how the array is laid out in memory 
and then analyze the reference patterns. Function cleari accesses the array using 
a stride-1 reference pattern and thus clearly has the best spatial locality. Function 
clear2 scans each of the N structs in order, which is good, but within each struct it 
hops around in a non-stride-1 pattern at the following offsets from the beginning 
of the struct: 0, 12, 4, 16, 8, 20. So clear2 has wórse spatial locality than clear1. 
Function clear3 not only hops around within each struct, but also hops from struct 
to struct. So clear3 exhibits worse spatial locality than clear2 and cleari. 


Solution to Problem 6.9 (page 616) 
The solution is a straightforward application of the definitions of the various cache 
parameters in Figure 6.26. Not very exciting, but you need to understand how 
the cache organization induces these partitions in the address bits before you can 
really understand how caches work. 





Cache m C B E S t S b 
1. 32 1,024 4 1 256 22 8 2 
2: 32 1,024 8 4 32 24 5 3 
3, 32 1,024 32 32 1 27 0 5 


Solution to Problem 6.10 (page 624) 
The padding eliminates the conflict misses. Thus, three-fourths of the references 
are hits. 


Solution to Problem 6.11 (page 624) 

Sometimes, understanding why something is a bad idea helps you understand why 
the alternative is a good idea. Here, the bad idea we are looking at is indexing the 
cache with the high-order bits instead. of the middle bits. : 


A. With high-order bit indexing, each contiguous array chunk consists of 2! 
blocks, where t is the number of tag bits. Thus, the first 2‘ contiguous blocks 
of the array would map to set 0, the next 2! blocks would map to set 1, and 
so on. 


B. For a direct-mapped cache where (5, E, B, m) — (512, 1, 32, 32), the cache 
capacity is 512 32-byte blocks with t — 18 tag bits in each cache line. Thus, the 
first 215 blocks in the array would map to set 0, the next 218 blocks to set 1. 
Since our array consists of only (4,096 x 4)/32 = 512 blocks, all of the blocks 
in the array map to set 0. Thus, the cache will hold at most 1 array block at 
any point in time, even though the array-is-small enough to fit entirely in the 
cache. Clearly, using high-order bit indexing makes poor use of the cache. 


Solution to Problem 6.12 (page 628) 
The2 low-order bits are the block offset (CO), followed by 3 bits of set index (CI), 
with the remaining bits serving as the tag (CT): 





4 3 1 
1 


Solution to Problem 6.13 (page 628) 
Address: 0x0E34 


A. Address format,(1 bit per box): 


cr 'c 


B. Memory reference: 


Parameter 

Cache block offset (CO) 
Cache set index (CT) 
Cache tag (CT) 

Cache hit? (Y/N) 

Cache byte returned 


Solution to Problem 6.14 (page 629) 
Address: OxODD5 


A.. Address format (1 bit per box): 


CT CT CT CT CT CT CT CT Cl Cl C co CO 
pspeprtsqrqepps ge eon) 
10 9 8 7 6 5 4 S 2 i d , 0 
k 


12 11 


B. Memory reference: 


Parameter 


ch ie 

Cache block offset (CO) 

Cache set indexi(CT) 

Cache tag (CT) 

Cache'hit? (Y/N) 

Cache byte returned — 
Solution to Problem 6.15 (page 629) 
Addregs: Ox1FE4 


A. Address format (1 bit per box): ‘ 

2. : 
cr cr CT ot ct CT cT CT a OC ^C co co | 
EXEXENESERESZUERENEHENENES | 
129 1 10 9 8 7 6 5 4 3 2 1 0 1 


B. Memory reference: 
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Parameter Value 
Caché-block'offset 0x0 

Cache set index 0x1 

Cache tag 'OxFF : 
Cache hit? (Y/N) N 


Cache byte returned — 


Solution to Problem 6.16 (page 630) i 

This problem is a sort of inverse version of Practice Problems, 6.12-6.15 that 
requires you to work backward from the contents of thé cache to derive the 
addresses that will hit in a particular set. In this case, set 3 contains one walid 
line with a tag Qf 0x32. Since there is only one valid line in the set, four addresses 
will hit. These addresses have, the binary form 0 0110 0100 11xx. Thus, the four 
hex addresses that hit in set 3 are 


0x064C, 0x064D, Ox064E, and 0x064F 


Solution to Problem 6.17 (page 636) 


A. The key to solving this problem is to visualize the picture in Figure 6.48. 
Notice that each cache line holds exactly one row of the array;that the cache 
is exactly large enough to hold one array, and that for all i, rqw i of src and 
dst maps to the same cache line. Because the cache is too small to hold both 
arrays, references to one array keep eVicting useful lines from the other array. 
For example, the write to dst [0]’[0] evicts the liné that was loaded when 
we read src [0] [0]. So when we next read src [0] [1], Wè have a miss. 





dst array src array 
Col.0 ^ Col1 " QCoL0 Coll 
Row 0 m m Row 0 m m 
Row1 m m Row 1 m h 


B. When the cache is 32 bytes, it is large enough to hold both arrays. Thus, the 
only misses are the initial cold misses. 





dst array src array 
Col. 0 Col. 1 Col. 0 Col. 1 
Row 0 m h Row 0 m h 
Row 1 m h Row 1 m h 


Figure 6.48 i 
Figure for solution to 
Problém 6.17. 
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Solution to Problem 6.18 (page 637) 
Each 16-byte cache line holds two contiguous algae_position structures. Each 
loop visits these structures in memory order, reading one integer element each 
time. So the pattern for each loop is miss, hit, miss, hit, and so on. Notice that for 
this problem we could have predicted the miss rate without actually enumerating 
the total number of reads and misses. 


A. What is the total number of read accesses? 512 reads. 
B. Whatis the total number of read accesses that miss in the cache? 256 misses. 
C. What is the miss rate? 256/512 = 50%. 


Solution to Problem 6.19 (page 638) 

The key to this problem is noticing that the cache can only hold 1/2 of the ar- 
ray. So the column-wise scan of the second half of the array evicts the lines that 
were loaded during the scan of the first half. For example, reading the first ele- 
ment of grid [8] [0] evicts the line that was loaded when we read elements from 
grid [0] [0]. This line also contained grid [0] [1]. So when we begin scanning the 
next column, the reference to the first element of grid [0] [1] misses. 


D M et: SE ORTA Rev A m T RT eL St i m en 


A. What is the total number of read accesses? 512 reads. 


B. Whatis the total number of read accesses that miss in the cache? 256 misses. 
C. What is the miss rate? 256/512 — 5096. 
D. 


What would the miss rate be if the cache were twice as big? If the cache were 
twice as big, it could hold the entire grid array. The only misses would be 
the initial cold misses, and the miss rate would be 1/4 — 256. 


Solution to Problem 6.20 (page 638) 
This loop has a nice stride-1 reference pattern, and thus the only misses are the 
initial cold misses. 


A. What is the total number of read accesses? 512 reads. 

B. What is the total number of read accesses that miss in the cache? 128 misses. P 
C. What is the miss rate? 128/512 — 2596. 
D. What would the miss rate be if the cache were twice as big? Increasing the i 


cache size by any amount would not change the miss rate, since cold misses 4 
are unavoidable. 


Solution to Problem 6.21 (page 643) 
The sustained throughput using large strides from L1 is about 12,000 MB/s, the 4 
clock frequency is 2,100 MHz, and the individual read accesses are in units 3 
of 8-byte longs. Thus, from this graph we can estimate that it takes roughly j 
2,100/12,000 x 8 = 1.4 * 1.5 cycles to access a word from L1 on this machine, | 
which is roughly 2.5 times faster than the nominal 4-cycle latency-from L1. Thisis 1 
due to the parallelism of the 4 x 4 unrolled loop, which allows multiple loads to 
be in flight at the same time. 





ur exploration of computer systems continues with a closer look 
atthesystems software that builds and runs application programs. 

` The linker combines different parts of our programs into a sin- 

é-filẹ that can be loaded into memory and executed by the processor. 

* Modern operating systems cooperate with the hardware to provide each 

s, program with the illusion that it has exclusive use of a processor and the 
ane .; main memory, when in reality multiple programs are running on the sys- 


"iem at any point in time. 
; _ In the first part of this book, you developed a good understanding of 
oe P the interaction between your programs and the hardware. Part II of the 
id "Book will broaden your view of systems by giving you a solid understand- 
?^ "ing of the interactions between your programs and the operating system. 
.* "You will learn how to use services provided by the operating system to 
y * build system-level programs such as Unix shells and dynamic memory 
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inking is the process of collecting and combining various pieces of code and 

data into a single file that can be loaded (copied) into memory and executed. 
Linking can be performed at compile time, when the source code is translated 
into machine code; at load time, when the program is loaded into memory and 
executed by the loader; and even at run time, by application programs. On early 
computer systems, linking was performed manually. On modern systems, linking 
is performed automatically by programs called linkers. 

Linkers play a crucial role in software development because they enable 
separate compilation. Instead of organizing a large application as one monolithic 
source file, we can decompose it into smaller, more manageable modules that can 
be modified and compiled separately. When we change one of these modules, we 
simply recompile it and relink the application, without having to recompile the 
other files. 

Linking is usually handled quietly by the linker and is not an important 
issue for students who are building small programs in introductory programming 
classes. So why bother learning about linking? 


e Understanding linkers will help you build large programs. Programmers who 
build large programs often encounter linker errors caused by missing modules, 
missing libraries, or incompatible library versions; Unless you understand how 
a linker resolves references, what a library is, and how a linker uses a library 
to resolve references, these kinds of errors will be baffling and frustrating. 


© Understanding linkers will help you avoid dangerous programming errors. The 
decisions that Linux linkers make when they resolve symbol references can 
silently affect the correctness of your programs. Programs that incorrectly 
define multiple global variables can pass through the linker without any warn- 
ings in the default case. The resulting programs can exhibit baffling run-time 
behavior and are extremely difficult to debug. We will show you how this hap- 
pens and how to avoid it. 


e Understanding linking will help you understand how language scoping rules 
are implemented. For example, what is the difference between global and local 
variables? What does it really mean when you define a variable or function 
with the static attribute? 


© Understanding linking will help you understand other important systems con- 
cepts. The executable object files produced by linkers play key roles in impor- 
tant systems functions such as loading and running programs, virtual memory, 
paging, and memory mapping. 

e Understanding linking will enable you to exploit shared libraries. For many 
years, linking was considered to be fairly straightforward and uninteresting. 
However, with the increased importance of shared libraries and dynamic 
linking in modern operating systems, linking is a sophisticated process that 
provides the knowledgeable programmer with significant power. For exam- 
ple, many software products use shared libraries to upgrade shrink-wrapped 
binaries at run time. Also, many Web servers rely on dynamic linking of shared 
libraries to serve dynamic content. 
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(a) main.c ] (b) sum. c ; 3 
code/link/main.c code/link/sum.c 

1 int sum(int *a, int n); 1 int sum(int *a, int n) 

2 2 1 

3 aint array[2] = (1, 2}; 3 int i, s = 0; 

4 4 

5 int main() 5 for (i = 0; i < n; i++) { 

6 (t 6 s += a[i]; 

7 int val = sum(array, 2); 7 } 

8 return val; 8 return 8; 

9 } 9 } 
code/link/main.c T code/link/sum.c 








Figure 7.1 Example program 1. The example program consists of two source files, main . c and sum. c. The 
main function initializes an array of ints, and then calls the sum function to sum the array elements, 


This chapter provides a thorough discussion of all aspects of linking, from 
traditional static linking, to-dynamic linking of shared libráries at load time, 
to dynamic linking of shared libraries at run time. We will describe the basic 
mechánisms using real examples, and we will identify'situations in which linking 
issues can affect the performance and correctness of your programs. To keep things 
concrete and understandable, we will couch our discussion in the context of an x86- 
64 system running Linux and using the standard ELF-64 (hereafter referred to as 
ELE).object file format. However, it is important to realize that the basic concepts 
of linking are universal, regardless of the operating system, the ISA, or the object 
file format. Details may vary, but the concepts are the same. 


7.1 Compiler Drivers 


Consider the C program in Figure 7.1. It will serve as a simple running example 
throughout this chapter that will allow us to make some important points about 
how linkers work. : 

Most compilation systems provide a compiler driver that invokes the language 
préprocessor, compiler, assembler, and linker, as needed on behalf of the user. For 
example, to build the exarhple program using the-GNU compilation systém, we 
might invoke the cce driver by typing the following command to the shell: 


linux?:gcc -0g -o prog main.c sum.c 


Figure 7.2 summarizes the activities of the driver as it translates the example 
program! from an ASCII source file'into arf executable object file. (If you want 
to see these steps for yourself, run acc with the -v option.) The driver first runs 
the C preprocessor (cpp),! which translates the C Source file main. c into an ASCII 
intermediate filé'main.i: . "s 


1. In some versions of ccc, the preprocessor is integrated into the compiler driver. 
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Static linking. The linker 
Translators Translators 
(cpp, cci, as)|| (cpp, cci, as) 
executable object file 


Figure 7.2 main.c sum.c Source files 
combines relocatable 
object files to form an 
prog. main.o sun.o Relocatable 
object files 
prog Fully linked 
executable object file 


cpp [other arguments] main.c /tmp/main.i 


Next, the driver runs the C compiler (cc1), which translates main.i into an ASCII 
assembly-language file main.s: 


cci /tmp/main.i “Og [other arguments] -o /tmp/main.s 


[4 


Then, the driver runs the assembler (as), which translates main.s into a binary 
relocatable object file main.o: 


as [other arguments] -o /tmp/main.o /tmp/main.s 


The driver goes through the same process to generate sun. o. Finally, it runs the 
linker program 1d, which combines main"o and sun.o, along with the necessary 
system object files, to creaté the binary executable object file prog: 


ld -o prog [system object files and args] /tmp/main.o /tmp/sum.o 
é 


To run the executable prog, we type its name on the Linux shell’s command 
line: : 


linux» ./prog 


The shell invokes a function in the operating system called the loader, which copies 
the code and data in the executable file prog into memory, and then transfers 
control to the beginning of the program. 


7.2 Static Linking 


Static linkers such as the Linux:Lp program take as input a collection of relocatable 
object files and command-line arguments and generate-as output a fully linked 
executable object file that can be loaded and run. The input relocatable object 1 
files consist of various code and data sections, where each section is a contiguous 1 
sequence of bytes. Instructions are in one section, initialized global variables are i 
in another section, and uninitialized variables are in yet another section. 
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To build the executable, the linker must perform two main tasks: 


Step 1. Symbol resolution. Object files define and reference symbols, where each 
symbol corresponds to a function, a global variable, or a static variable 
(i.e., any C variable declared with the static attribute). The purpose of 
symbol resolution is to associate each symbol reference with exactly one 
symbol definition. 


Step 2. Relocation. Compilers and assemblers generate code and data sections 
that start at address 0. The linker relocates these sections by associating a 
memory location with each symbol definition, and then modifying all of 
the references to those symbols so that they point to this memory location. 
The linker blindly performs these relocations using detailed instructions, 
generated by the assembler, called relocation entries. 


The sections that follow describe these tasks in more detail. As you read, keep 
in mind some basic facts about linkers: Object files are merely collections of blocks 
of bytes. Some of these blocks contain program code, others contain program 
data, and others contain data structures that guide the linker and loader. A linker 
concatenates blocks together, decides on run-time locations for the concatenated 
blocks, and modifies various locations within the code and data blocks. Linkers 
have minimal understanding of the target machine. The compilers and assemblers 
that generate the object files have already done most of the work. 


7.3 Object Files 


Object files come in three forms: 


Relocatable object file. Contains binary code and data in a form that can be 
combined with other relocatable object files at compile time to create an 
executable object file. 


Executable object file. Contains binary code and data in a form that can be 
copied directly into memory and executed. 


Shared object file. A special type of relocatable object file that can be loaded 
into memory and linked dynamically, at either load time or run time. 


Compilers and assemblers generate relocatatle object files (including shared 
object files). Linkers generate executable object files. Technically, an object module 
is a sequence of bytes, and an object file is an object module stored on disk in a 
file. However, we will use these terms interchangeably. 

Object files are organized according to specific object file formats, which vary 
from system ‘to system. The first Unix systems from Bell Labs used the a. out 
format. (To this day, executables are Still referred to as a.out files.) Windows 
uses the Portable Executable (PE) format. Mac OS-X uses the Mach-O format. 
Modern x86-64 Linux and Unix systems use Executable and Linkable Format 
(ELF). Although our discussion will focus on ELF, the basic concepts are similar, 
regardless of the particular format. 
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Figure 7.3 


ELF header 


Typical ELF relocatable a a 


object file. 


Sections ya 
object file { Section header table 


sections 





7.4 Relocatable Object Files 


Figure 7.3 shows the format of a typical ELF relocatable object file. The ELF 
header begins with a 16-byte sequence that describes the word size and byte 
ordering of the system that generated the file. The rest of the ELF header contains 
information that allows a linker to parse and interpret the object file. This includes 
the size of the ELF header, the object file type (e.g., relocatable, executable, or 
shared), the machine type (e.g., x86-64), the file offset of the section header table, 
and the size and number of entries in the section header table. The locations 
and sizes of the various sections are described by the section header table, which 
contains a fixed-size entry for each section in the object file. 

Sandwiched between the ELF header and the section header table are the j 
sections themselves. A typical ELF relocatable object file contains the following 3 
sections: | 


.text The machine code of the compiled program. 


.rodata Read-only data such as the format strings in printf statements, and j 
jump tables for switch statements. : 


.data [nitialized global and static C variables. Local C variables are maintained 4 
at run time on the stack and do not appear in either the .data or .bss 4 
sections. 


Uninitialized global and static C variables, along with any global or static 4 
variables that are initialized to zero. This section occupies no actual space ; 
in the object file; it is merely a placeholder. Object file formats distinguish 4 
between initialized and uninitialized variables for space efficiency: unini- 4 
tialized variables do not have to occupy any actual disk space in the object 1 
file. At run time, these variables are allocated in memory with an initial 

value of zero. 
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Aside Why is uninitialized data called. . bss? ‘ "PM" 

The use of the term .bss to denote uninitialized data is universal.‘It'wvas‘riginally ah acronym for the 

"block started by symbol” directive from the IBM 704assembly language (circa 1957) and the acronym 
= has stuck; A'simple way to remember the difference between thé .datá and .bss sections is to think 
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-symtab A symbol table with information about functions and global variables 
that are defined and referenced in the program. Some programmers mis- 
takenly believe that a program must be compiled with the -g option to 
get symbol table information. In fact, every relocatable object file has 
a symbol table in .symtab (unless the programmer has specificaily re- 
moved it with the strip command). However, unlike the symbol table 
inside a compiler, the . symtab symbol table does not contain entries for 
local variables. 





-rel.text ‘A list of locations in the . text section that will need to be modified 
when the linker combines this object file with others. In general, any 
instruction that calls an external function or references a global variable 
will need to be modified. On the other hand, instructions that call local 

; functions do not need to be modified. Note that relocation information 

3 is not needed in executable object files, and is usually omitted unless the 

: user explicitly instructs the linker to include it. 











-rel.data Relocation information for any global variables that are referenced 
or defined by the module. In general, any initialized global variable whose 
initial value is the address of a global variable or externally defined func- 
tion will need to be modified. 


debug A debugging symbol table with entries for local variables and typedefs 
defined in the program, global variables defined and referenced in the 
program, and the original C source file. It is only present if the compiler 
driver is invoked with the -g option. 


-line A mapping between line numbers in the original C source program and 
machine code instructions in the .text section. It is only present if the 
, compiler driver is invoked with the -g option. 


.strtab A string table for the symbol tables in the .Symtab and .debug sec- 
tions and for-the section names in the section headers. A string table is a 
sequence of null-terminated character strings. 


7.5 Symbols and Symbol Tables 


Each relocatable object module, m, has a symbol table that contains information 
about the symbols that are defined and referenced by m. In the context of a linker, 
there are three different kinds of symbols: 
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e Global symbols that are defined by module m and that can be referenced by 
other modules. Global linker symbols correspond to nonstatic C functions and 
global variables. 

Global symbols that are referenced by module m but defined by some other 
module. Such symbols are called externals and correspond to nonstatic C 
functions and global variables that are defined in other modules. 


Local symbols that are defined and referenced exclusively by module m. These 
correspond to static C functions and global variables that are defined with the 
static attribute. These symbols are visible anywhere within module m, but 
cannot be referenced by other modules. 


It i$ important to realize that local linker symbols are not the same as local 
program variables. The symbol table in .symtab does not contain any symbols 
that correspond to local nonstatic,program variables. These are managed at run 
time on the stack and are not of interest to the linker. 

Interestingly, local procedure variables that are defined with the C static 
attribute are not managed on the stack. Instead, the compiler allocates space in 
.dataor bss for each definition and creates a local linker symbol in the symbol 
table with a unique name. For example, suppose a pair of functions in the same 
module define a static local variable x: 


1 int £0 

2 ít 

3 static int x = 0; 
4 return Xj; 

5 } 

6 

7 int gO 

s t 

9 static int x = 1; 
10 return Xj; 

11 } 


In this case, the compiler exports a pair of local linker symbols with different names 4 
to the assembler. For example, it miglit use x. 1 for the definition in function f and ¢ 
x.2 for the definition in function g. : 

Symbol tables are built by assemblers, using symbols exported by the compiler | 
into the assembly-language .s file. An ELF symbol table is contained in the ; 
.gymtab section. It contains an array of entries. Figure 7.4:shows the format of 
each entry. 

The name is a byte offset into the string table that points to the null-termipated 1 
string name of the symbol. The value is the symbol's address. For relocatable 
modules, the value is an offset from thé beginning of the section where the object } 
is defined. For executable object files, the value is an absolute run-time address. 1 
The size is the size (in bytes),of the object. Thé type is usually either data or į 
function. The symbol table can also contain entries for the individual sections | 





Section 7.5 Symbols and Symbol Tables 677 


gt nme apros ien EP POM ge vii ig My, equ. PP e At og ey am a M mm sno t "E Ü 


2 jd Es 


'-t at d . e, M A a 
New to, C?. „Hiding variablé and function namés with: statica a 


_ Oprograimrhets“usé the statid‘attribute'fo hide variable ahd furictión' déclárations inside modules, 
, Inuch as*you wauld use public anid’ private. declatations’in“ava and-C4%, In- C; source files play the 
"role df modules. Any global Variable or functióii declared With thie $tatf&attribute is private to: that 
` module. Similarly, any-gidbal vatiable dr fuhction declared Without the’static-attribute is public and 
; can be accessed by any other module. It is good programming i practice to protect your variables‘and 


J functions, with the static attribute wherever possible. 


rewire tbe v AS nM aps LM NE A mna 


* i & 


T uu * $ 


omo — — — — code/link'elfstructs.c 


1 “typedef struct { 

2 int nane; /* String table offset */ 

3 char type:4, /* Function or data (4 bits) */ 

4 binding:4; /* Local or global (4 bits) */ 

5 char reserved; /* Unused */ 

6 Short section; /* Section header index */ 

7 long value; /* Section offset or absolute address */ 
8 long size; /* Object size in bytes */ 

9 } Elf64 Symbol; 


Uomo —— —— code/link/elfstructs.c 
Figure 7.4 ELF symbol table entry. The type and binding fields are 4 bits each. 


and for the path name of the original source file. So there are distinct types for 
these objects as well. The binding field indicates whether the symbol is local or 
global. 

Each symbol is assigned to some section of the object file, denoted by the sec- 
tion field, which is an index into the section header table. There are three special 
pseudosections that don't have entries in the section header table: ABS is for sym- 
bols that should not be relocated. UNDEF is for undefined symbols—that is, sym- 
bols that are referenced in this object module but defined elsewhere. COMMON 
is for uninitialized data objects that are not yet allocated. For COMMON symbols, 
the value field gives the alignment requirement, and size gives the minimum size. 
Note that these pseudosections exist only in relocatable object files; they do not 
exist in executable object files. 

The distinction between COMMON and .bss is subtle. Modern versions of 
Gcc assign symbols in relocatable object files to COMMON and .bss using the 
following convention: 


COMMON Uninitialized global variables 


.bss Uninitialized static variables, and global or static variables that are 
initialized to zero 
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(2)m.c 


void swapO; 


int mainO 
{ 
swap(); 


wo 0604 Q8 t^ RW IN HS 


int buf(2] = 


return 0; 


The reason for this seemingly arbitrary distinction stems from the way the linker 
performs symbol resolution, which we will explain in Section 7.6. 

The GNU READELF program is a handy tool for viewing the contents of object 
files: For example, here are the last three symbol table entries for the relocatable 
object file nain.o, from the example program in Figure 7.1. The first eight entries, 
which are not shown, are local symbols that the linker uses internally. 


Num: Value Size Type Bind Vis Ndx Name 
8: 0000000000000000 24 FUNC GLOBAL DEFAULT 1 main 
9: 0000000000000000 8 OBJECT GLOBAL DEFAULT 3 array 

10: 0000000000000000 O NOTYPE GLOBAL DEFAULT UND sum 


In this example, we see an entry for the definition of global symbol main, a 24- 
byte function located at an offset (i.e., value) of zero in the . text section. This 
is followed by the definition of the global symbol array, an 8-byte object located 
at an offset of zero in the .data section. The last entry comes from the reference 
to the external symbol sum. READELF identifies each section by an integer index. 
Ndx-1 denotes the . text section, and Ndx-3 denotes the . data section. 


This problem concerns the m.o and swap.o modules from Figure 7.5. For each 
symbol that is defined or referenced in swap.o, indicate whether or not it will 
have a symbol table entry in the . symtab section in module swap . orIf so, indicate 
the module that defines the symbol (swap. o orm. o), the symbol type (local, global, 
or extern), and the section (. text, .data, .bss, or COMMON) it is assigned to 
in the module. 


] b) swap. c 
code/link/m.c P I ue est Led code/link/swap.c 


extern int buf[]; 
(1, 2}; int *bufpO = &buf [0]; 
int *bufp1; 


void swap() 
i 


int temp; 


vo OO ww A HH AWD = 


code/link/m.c bufpi = &buf [1]; 
temp = *bufp0; 
*bufpO = *bufpi; 
*bufpl = temp; 


m^ CASA 
a w N = 


code/link/swap.c 


Figure 7.5 Example program for Practice Problem 7.1. 
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Symbol .symtab entry? Symbol type Module where defined Section 
SS es 
buf 


bnfpO 
bufpi 
swap 
temp 


7.6 Symbol Resolution 


The linker resolves symbol references by associating each reference with exactly 
one symbol definition from the symbol tables of its input relocatable object files. 
Symbol resolution is straightforward for references to local symbols that are de- 
fined in the same module as tbe reference. The compiler alloys only one definition 
of each local symbol per module. The compiler also ensures that static local vari- 
ables, which get local linker symbols, have unique names. 

Resolving references to global symbols, however, is trickier. When the com- 
piler encounters a symbol (either a variable or function name) that is not defined 
in the current module, it assumes that it is defined in some other module, gener- 
ates a linker symbol table entry, and leaves it for the linker to handle. If the linker 
is unable to find a definition for the referenced symbol in any of its input modules, 
it prints an (often cryptic) error message and terminates. For example, if we try to 
compile and link the following source file on a Linux machine, 


H 


void foo(void); 


1 

2 

3 int mainO { 
4 fooQ; 

5 return 0; 
6 


h 


4 
then the compiler runs without a hitch, but the linker terminates when it cannot 
resolve the reference to foo: 


linux» gcc -Wall -Og -o linkerror linkerror.c 
/tmp/ccSzbuti.o: In function 'main': 
/tmp/ccSz5uti.o(.text+0x7): undefined reference to 'foo' 


Symbol resolution for global symbols is also tricky because multiple object 
modules might define global symbols with the same name. In this case, the linker 
must either flag an error or somehow choose one of the definitions and discard 
the rest. The approach adopted by Linux systems involves cooperation between 
the compiler, assembler, and linker and can introduce some baffling bugs to the 
unwary programmer. 
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7.6.1 How Linkers Resolve Duplicate Symbol Names 


The input to the linker is a collection of relocatable object modules. Each of these 
modules defines a set of symbols, some of which are local (visible only to the 
module that defines it), and some of which are global (visible to other modules). 
What happens if multiple modules define global symbols with the same name? 
Here is the approach that Linux compilation systems use. 

At compile time, the compiler exports cach global symbol to the assembler 
as either strong or weak, and the assembler encodes this information. implicitly 
in the symbol table of the relocatable object file. Functions and initialized global 
variables get strong symbols. Uninitialized global variables get weak symbols. 

Given this notion of strong and weak symbols, Linux linkers use the following 
rules for dealing with duplicate symbol names: 


Rule 1. Multiple strong symbols with the same name are not allowed. 


Rule2. Given a strong symbol and multiple weak symbols with the same name, 
choose the strong symbol. 


Rule 3. Given multiple weak symbols with the same name, choose any of the 
weak symbols. 


For example, suppose we attempt to compile and link the following two C modules: 


/* fool.c */ 
int mainO 
1 

return 0; 


i 


/* bari.c */ 
int main() 
1 

return 0; 


} 
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Tn this éase, the linker will generate.an error message because the strong symbol 


main is defined multiple times (rule 1): 
+ 


linux» gcc fooi.c bari.c 
/tmp/ccq2Uxnd.o: In function ‘main': 
bar1.c:(.text+0x0): multiple definition of- 'main' 


Similarly, the linker will generate an error message for the following modules 
because the, strong symbol x is defined twice (rule 1): 
T 3 [Ier E 


1 /* foo2.c */ 
2 int x - 15213; 
3 
4 int main() 
5 t 
6 return 0; 
7 3 

(d /* bar2.c */ 
2 int x = 15213; 
3 
4 void f() 
s t 
6 +} 


However, if x is uninitialized in one module, then the linker will quietly choose 
the strong symbol defined in the other (rule 2): 


/* foo3.c */ 
#include <stdio.h> 
void f(void); 


int x = 15213; i 


int main(}, 

{ * 
£0; 
printf("x = %d\n", x); 
return 0; 


i0 00 4 A HH Rh 0 No 


a - 
- © 


} 


- 
N 


/* bar3.c */ 
int x; 


void f() 
{ 
x = 15212; 


NOn à 0 No — 
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At run time, function f changes the value of x from 15213. to 15212, which 
might come as an unwelcome surprise to the author of function main!:Notice, that 
the linker normally gives no indication that it has detected multiple definitions 
ofx ` 


linux» gcc -o foobar3 foo3.c bar3.c 
linux? ,/foobar3 
x =,15212 


Thé same thing can happen if there are two weak definitions of x (rule 3): 


/* food.c */ 
#include <stdio.h> 
void f(void); 


int x; 


int mainO 


{ 


own DA WN “a 


x = 15213; 

£0; 

printf("x = d\n", x); 
return 0; 


} 


/* bar4.c */ 
int X; 


void £() 
{ 

x = 15212; 
} 


The application of rules 2 and 3 can introduce some insidious run-time bugs j 
that are incomprehensible to the unwary programmer, especially if the duplicate 
symbol definitions have different types. Consider the following example, in which ; 
x is inadvertently defined as an int in one module and a double in another: 


/* fooB.c */ 
#include <stdio.h> 
void f(void); 


int y = 15212; 
int x = 15213; 


int main() 


{ 


v0 00 € Rb wNC- 


fO; 


- 
e 
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E printf("x = Ox/x y = Ox%x Wn", ` 
12 x, y : 
13 return 0; 

14 } 

1 /* barb.c */ 

2 double x; 

3 

4 void fO 

s- | 

6 x = -0.0; 

7 4} 


On an x86-64/Linux machine, doubles are 8 bytes and ints are 4 bytes. On 
our system, the address of x is 0x601020 and the address of y is 0x601024. Thus, 
the assignment x = -0.0 in line 6 of bar5.c will overwrite the memory locations 
for x and y (lines 5 and 6 in foo5.c) with the double-precision floating-point 
representation of negative zero! 


linux? gcc -Wall -Og -o foobar5 foob5.c bard.c 

/usr/bin/ld: Warning: alignment 4 of symbol ‘x' in /tmp/cclUFK5g.o 
is smaller than 8 in /tmp/ccbTLcb9.o 

linux? ./foobar5 

x = 0x0 y = 0x80000000 


This is a subtle and nasty bug, especially because it triggers only a warning from 
the linker, and because it typically manifests itself much later in the execution 
of the program, far away from where the error occurred. In a large system with 
hundreds of modules, a bug of this kind is extremely hard to fix, especially because 
many programmers are not aware of how linkers work, and because they often 
ignore compiler warnings. When in doubt, invoke the linker with a flag such 
as the Gcc -fno-common flag, which triggers an error if.it encounters multiply- 
defined global symbols. Or use the -Werror option, which turns all warnings into 
errors. 

In Section 7.5, we saw how the compiler assigns symbols to COMMON and 
. bss using a seemingly arbitrary convention. Actually, this convention is due to 
the factthat in some cases the linker allows multiple modules to define global 
symbols with the same name. When the compiler is translating some module and 
encounters a weak global symbol, say, x, it does not know if other modules also 
define x, and if so, it cannot predict which of the multiple instances of x the linker 
might choose. So the compiler defers the decision to the-linker by assigning x to 
COMMON. On the other hand, if x is initialized to zero, then it'is a strong symbol 
(and thus must be unique by rule 2), so the compiler can confidently assign it to 
:bss. Similarly, static symbols are unique by construction, so the compiler can 
confidently assign them.to either .data or .bss. 
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ACI Sy. C xs, E h PORE 


x.k) denote that the linker will associate an 


definitions (rule 3), write *UNKNOWN". 


A. /* Module 1 */ /* Module 2 */ 
int mainO ^ int main; 
1 int p2() 
y 1 
} 


(a) REF(main.1) —DEF( 
(b) REF(main.2) — DEF( iei ei Li 


. /* Module 1 */ /* Module 2 */ 
void main() int main = 1; 
í int p2() 

} { 
} 


(a) REF(main.1) —> DEF(.__.____.. _) 
(b) REF(main.2) — DEF(. ) 


. /* Module 1 */ /* Module 2 */ 
int x; double x = 1.0; 
void main() int p20 
1 1 
} } 


(a) REF(x.1) — DEF.) 
(b) REF(x.2) -> DEF( NT ) 


7.6.2 Linking with Static Libraries 


So far, we have assumed that the linker reads a collection o£relocatable object files | 
and links them.together into an output executable file. In practice, all compilátion | 
systems provide a mechanism for packaging related object modules into a single 
file called a static library, which can then be supplied as input to the linker. When 4 
it builds the output executable, the linkef copies only the object modules in the i 
library that are referenced by the application program. ü iJ 

Why do systems support the notion of libraries? Consider ISO €99, which | 
defines an extensive collection of standard I/O, string manipulation, and integer | 
math functions such as atoi, printf, scanf, strcpy, and rand. They are available 
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to every € program in the 1ibc.a library. ISO C99 also defines an extensive 
collection of floating-point math functions such as sin, cos, and sqrt in the libm.a 
library. 
Consider the different approaches that compiler developers might use to pro- 
vide these functions to users without the benefit of static libraries. One approách 
: would be to’ have the compiler recognize calls to the standard functions and to 
generate the appropriate code directly. Pascal, which provides a small set of stan- 
dard functions, takes this approach, but it is not feasible for C, because of the large 
number of standard functions defined by the C standard. It would add significant 
į complexity to the compiler and would require a new compiler version each time a 
function was added, deleted, or modified. To application programmers, however, 
this approach would be quite convenient because the standard functions would 
always be available. 
Another approach would be to put all of the standard C functions in a single 
relocatable object module, say, 1ibc.o, that application programmers could link 
| into their executables: 


linux? gcc main.c /usr/lib/libc.o 


This approach has the advantage that it would decouple the implementation of the 
standard functions from the implementation of the compiler''and would still bè 
reasonably convenient for programmers. However, a big disadvantage is that ev- 
ery executable file in a system would now contain a complete copy of the'collection 
of standard functions, which would be'extremely wasteful of disk space. (On our 
system, libc!a is about 5 MB and libm.a is about 2 MB.) Worse, each running 
program would now contairi its own copy of these functions in‘memory, which 
would be extrerely wasteful òf memory. Another big disadvantage is‘that any 
change to any standard function, no matter how small, would réquire the library 
developer to recompile the entire source file, a time-consuming operation that 
would complicate'the development and maintenance of tHe Standard functions. 

We could address somé of these probléms:by creating a’separate-relocatable 
file for each standard function and storing them in a well-known directory. How- 
ever, this approach would require application programmers to explicitly link the 
appropriate object modules into their executables, a' process that would be érror 
prone and time consuming: ' af 


linux> gcc main.c /usr/lib/printf.o /usr/lib/scanf.o ... 


The notion of a static library was developed to resolve the disadvantages of 
these various approaches. Related functions can be compiled into separate object 
modules-and then packaged in a single static library file. Application programs 
can then use any of the functions defined in the library by specifying ‘a single 
filename on the command line. For example, a program that uses functions from 
the C standard library and the math library could be compiled and linked with a 
command of the form 


linux? gcc main.c /usr/lib/libm.a /usr/lib/libc.a 
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a) addvec.o b) multvec.o 
a eres ht c NS RE n code/link/addvec.c eye Sees RR code/link/multvec.c 


int addcnt = 0; int multcnt = 0; 


void multvec(int *x, int *y, 
int *z, int n) 


void addvec(int *x, int *y, 
int *z, int n) 
int i; int i; 


addcnt++; multcnt-t; 


; 
2 
3 
4 
s ít 
6 
7 
8 
9 


for (i = 0; i < n; i++) 10 for (i = 
zi] = x[i] + ylil; 11 z[i] 
} 12 


n ~ m m code/link/addvec.c code/link/multvec.c 


0; i <n; i++) 
= x[i] * yli]; 


Figure 7.6 Member object files in the Libvector library. 


Atlink time, the linker will only copy the object modules that are referenced 
by the program, which reduces the size of the executable on disk and in memory. 
On the other hand, the application programmer only needs to include the names 
of a few library files. (In fact, C compiler drivers always pass libc.a to the linker, 
so the reference to libc.a mentioned previously is unnecessary.) 

On Linux systems, static libraries are stored on disk in a particular file format 
known as an archive. An archive is a collection of concatenated relocatable object 
files, with a header that describes the size and location of each member object file. 
Archive filenames are denoted with the .a suffix. 

To make-our discussion of libraries concrete, consider the pair of vector 
routines in Figure 7.6. Each routine, defined in its own object module, performs a 
vector operation on a pair of input vectors and stores the result jn an output vector. | 
As a side effect, each routine records the number of times it has.been called by | 
incrementing a global variable. (This will be useful when we explain the idea of | 
position-independent code in Section 7.12.) 

To create a static library of these functions, we would use the AR tool as follows: i 


linux? gcc -c addvec.c multvec.c 
linux» ar rcs libvector.a addvec.o multvec.o 


To use the library, we might write an application such as main2.c in Figure 7.7, 
which invokes the addvec library routine. The include (or header) file vector.h 
defines the fünction prototypes for the routines in 1libvector.a, 

To build the executable, we would compile and link the input files nain2.o | 
and libvector.a: > 


` 


linux> gcc -c main2.c 
linux> gcc -static -o prog2c main2.0 ./libvector.a 
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aa ae code/link/main2.c 


#include <stdio.h> 
#include “vector.h" 


re 


1 
2 

3 

4 int x[2] = (1, 2); 
5S int y[2] = (3, 4}; 
6 int z[2]; 

7 

8 

9 


int main() 
t bh 
addvec(x, y, z, 2); 
printf("z = [4d %d]\n", z[0], z[10); 
return 0; 
- T code/link/main2.c 


Figure 7.7 Example program 2. This program invokes a function in the 1ibvector 
library. 


Source files main2.c vector.h 


Translators 
{cpp, cci, as)| libvector.a libc:a Static libraries 


Relocatable main2.o addvec.o printf.o and any other 
object files modules called by printf.o 


Prog2c Fully linked . 
executable object file 


Figure 7.8 Linking with static libraries. 


or equivalently, 


linux» gcc -c main2.c 
linux? gcc -static -o prog2c main2.0 -L. -lvector 
’ 


Figure 7.8 summarizes the activity of the linker. The -static argument tells the 
compiler driver that the linker should build a fully linked executable object file 
that can be loaded into memory and run without any further linking at load time. 
The -1vector argument is a shorthand for libvector.a, and the -L. argument 
tells the linker to look for Libvector .a in the current directory. 

When the linker runs, it determines that the addvec symbol defined by 
addvec.o is referenced by main2.o, so iticopies addvec.o into the executable. 
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Since the program doesn't reference any symbols defined by multvec. o, the linker 
does not copy this module into the executable. The linker also copies the printf .o 
module from libc.a, along with a number of other modules from the C run-time 


system. 








7.6.3 How Linkers Use Static Libraries to Resolve References 






While static libraries are useful, they are also a source of confusion to program- 
mers because of the way the Linux linker uses them to resolve external references 
During the symbol resolution phase, the linker scans the relocatable object files 
and archives left to right in the same sequential order that they appear on the 
compiler driver's command line. (The driver automatically translates any .c files 
: on the command line into .o files.) During this scan, the linker maintains a set E 
bh of relocatable object files that will be merged to form the executable, a set U of 
unresolved symbols (i.e., symbols referred to but not yet defined), and a set D of 
symbols that have been defined in previous input files. Initially, E, U, and D are 
empty. 


















* For each input file f on the command line, the linker determines if f is an 
object file or an archive. If f is an object file, the linker adds f to E, updates 
U and D to reflect the symbol definitions and references in f, and proceeds 
to the next input file. 

If f is an archive, the linker attempts to match the-unresolved symbols in U 
against the symbols defined by the members of the archive. If some archive 
member m defines a symbol that resolves a reference in U, then m is added 
to £, and the linker updates U and D to reflect the symbol definitions and 
references in m. This process iterates over the member object files in the 
archive until a fixed point is reached where U and D no longer change. At 
this point, any member object files not contained in E are simply discarded 
and the linker proceeds to the next input file. 

If U is nonempty when the linker finishes scanning the input files on the 
command line, it prints an error and terminates. Otherwise, it merges and 
relocates the object files in E to build the output executable file. 
























Unfortunately, this algorithm can result in some baffling link-time errors 
because the ordering of libraries and object files on the command line is significant. 
If the library that defines a symbol appears on the command line before the object 
file that references that symbol, then the reference will not be resolved and linking 
will fail. For example, consider the following: 
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linux» gcc -static ./libvector.a main2.c 
/tmp/cc9XH6Rp.o: In function 'main': 
/tmp/cc9XH6Rp.o(.text+0x18): undefined reference to ‘addvec' 







What happened? When libvector.a is processed, U is empty, so no member 4 
object files from libvector.a are added to E. Thus, the reference to addvec is 3 
never resolved and the linker emits an error message and terminates. ; 
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The general rule for libraries is to place.them at-the end of the command 
line. If the members of the different libraries are independent, in that no member 
references a symbol defined by another member, then the libraries can be placed 
at the end of the command line in any order. If, on the other hand, the libraries 
are not independent, then they must be ordered so that for each symbol s that 
is referenced externally by a member of an archive, at least one definition of s 
follows a reference to s on the command line. For example, suppose foo.c calls 
functions in 1libx.a and libz.a that call functions in liby.a. Then libx.a and 
libz.a must precede liby.a on the command line: 


linux» gcc foo.c libx.a libz.a liby.a 


Libraries can be repeated on the command line if necessary to satisfy the 
dependence requirements. For example, suppose foo .c calls a function in libx.a 
that calls a function in liby.a that calls a function in libx.a. Then libx.a must 
be repeated on the command line: 


linux» gcc foo.c libx.a liby.a libx.a 


Alternatively, we could combine 1ibx.a and liby.a into a single archive. 


Letaandb denote object nodules or static libraries in he current directory, and 


let a-4b denote that a depends on b, in the sense that b defines a symbol that is 


referénced by a. For each of the following scenarios, show the minimal command 
line (i.e., one with the least number of object file and library arguments) that will 
allow the static linker to resolve all symbol references. 


A. p.o > libx.a 
B. p.o — libx.a— liby.a 


C. p.o — libx.a— liby.a and liby.a-—» libx.a — p.o 


7.7 Relocation 


Once the linker has completed the symbol resolution step, it has associated each 
symbol reference in the code with exactly one symbol definition (i.e., a symbol 
table entry in one of its input object modules). At this point, the linker knows 
the exact sizes of the code and data sections in its input object modules. It is now 
ready to begin the relocation step, where it merges the input modules and assigns 
run-time addresses to each symbol. Relocation consists of two steps: 


1. Relocating sections and symbol definitions. In this step, the linker merges all 
sections of the same type into a new aggregate section of the same type. For 
example, the . data sections from the input modules are all merged into one 
section that will become the .data section for the output executable object 
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file. The linker then assigns run-time memory addresses to the new aggregate 
sections, to each section defined by the input modules, and to each symbol 
defined by the input modules. When this step is complete, each instruction 
and global variable in the program has a unique run-time memory address. 

i 2. Relocating symbol references within sections. In this step, the linker modifies 
i every symbol reference in the bodies of the code and data sections'so that 
they point to the correct run-time addresses. To perform this step, the linker 
relies on data structures in the relocatable object modules known as relocation 
entries, which we describe next. 








7.7.1 Relocation Entries 


When an assembler generates an object module, it does not know where the code 

and data will ultimately be stored in memory. Nor does it know the locations of 
l any externally defined functions or global variables that are referenced by the 
module. So whenever the assembler encounters a reference to an object whose 
ultimate location is unknown, it generates a relocation entry that tells the linker 
how to modify the reference when it merges the object file into an executable. 
Relocation entries for code are placed in .rel.text. Relocation entries for data 
are placed in .rel.data. 

Figure 7.9 shows the format of an ELF relocation entry. The offset is the 
section offset of the reference that will need to be modified. The symbol identifies 
the symbol that the modified reference should point to. The type tells the linker 
how to modify thé new reference. The addend is a signed constant that is used by 
some types of relocations to bias the value of the modified reference. ~ 

ELF defines 32 different relocation types, many quite arcane. We are con- 
cerned with only the two most basic relocation types: 


R_X86_64_PC32. Relocate a reference that uses a 32-bit PC-relative address. 

Recall from Section 3.6.3 that a PC-relative address is an offset from 
the current run-time value of the program counter (PC). When the CPU 
executes an instruction using PC-relative addressing, it forms the effective 
address (e.g., the target of the ca11 instruction) by adding the 32-bit value 


RT cu EE ET ee oe 


ee — —  codellink/elfstrugts.c 


k 
1 typedef struct { 
| 2 long offset; /* Offset of the reference to relocate */ 
3 long type:32,  /* Relocation type */ 
4 symbol:32; /* Symbol table index */ 
5 long addend; /* Constant part of relocation expression */ 
6 ) E1f64, Rela; 


ee —— code/link/elfstructsc 


Figure7.9 ELF relocation entry. Each entry identifies a reference that must be relocated 
and specifies how to compute the modified reference. 
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encoded in the instruction to the current run-time value of the PC, which 
is always the address of the next instruction in memory. 


R. X86, 64 32. Relocate a reference that uses*a 32-bit absolute address. With 
absolute addressing, the CPU directly uses the 32-bit value encoded in 
the instruction as the effective address, without further modifications. 










These two relocation types support the x86-64 small code model, which as- 
sumes that the total size of the code and data in thé execütable object file is smaller 
than 2 GB, and thus can be accessed at run-time using 32-bit PC-relative addresses. 
The small code model is the default for cc. Programs larger than 2 GB can be 
compiled using the -mcmodel=medium (medium code model) and -mcmodel=large aM 
(large code model) fiags, but we won't discuss those. o 














7.7.2 Relocating Symbol References 


Figure 7.10 shows the pseudocode for the linker's relocation algorithm. Lines 1 d 
and 2 iterate over each section s and each relocation entry r associated with each Ps | 
section. For concreteness, assume that each section s is an array of bytes and that 
each relocation entry ris a struct of type E1£64_Rela, as defined in Figure 7.9. 
Also, assume that when the algorithm runs, the linker has already:chosen run- I | 
time addresses for each section (denoted ADDR(s)) and each symbol (denoted 1 | 
ADDR (r: symbol)). Line 3 computes the address in the s array of the 4-byte ref- D 
erence that needs to be relocated. If this reference uses PC-relative addressing, 


then it is relocated by lines 5-9. If the reference uses absolute addressing, then it | 4 
is relocated by lines 11-13. 


foreach section s { 
foreach relocation entry r { 










refptr = s + r.offset; /* ptr to reference to be relocated */ 






/* Relocate a PC-relative reference */ 
if (r.type == R X86. 64 PC32) ( |. | 
refaddr = ADDR(s) + r.offset; /* ref's run-time address */ E 

*refptr = (unsigned) (ADDR(r.symbol) + r.addend - refaddr); ne 







} 





/* Relocate an absolute reference */ 





12 if (r.type == R_X86_64_32) if 
13 *refptr = (unsigned) (ADDR(r.symbol) + r.addend); 1 
14 } D. | 
15} ; 


Figure 7.10 Relocation algorithm. 
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a a code/link/main-relo.d 


1 0000000000000000 «main»: 

2 0: 48 83 ec 08 sub $0x8 ,%TSp 

3 4: be 02 00 00 00 mov $0x2,Zesi 

4 9: bf 00 00 00 00 mov $0x0, Zedi Kedi = &array 

5 a: R_X86_64_32 array Relocation entry 


e8 00 00 00 00 callq 13 <main+0x13> sum() 
f: R.X86.64 PC32 sum-Ox4 Relocation entry 
48 83 c4 08 add $0x8,Xrsp 


Figure 7.11 Code and relocation entries from main. o. The original C code is in Figure 7.1. 


Let's see how the linker uses this algorithm to relocate the references in our 
example program in Figure 7.1. Figure 7.11 shows the disassembled code from 
main.o, as generated by the GNU oBJDUMP tool (objdump -dx main.o). 

The nain function references two global symbols, array and spm. For each 
reference, the assembler has génerated a relocation entry, which is displayed on 
the following line? The relocation entries tell the linker that the reference to sum 
should be relocated using a 32-bit PC-relative address, and the reference to array 
should be relocated using a 32-bit absolute address. The next two sections detail 
how the linker relocates these references. 


Relocating PC-Relative References 


In line 6 in Figure 7.11, function main calls the sum function, which is defined in 
module sun. o. The ca11 instruction begins at section offset Oxe and consists of the 
1-byte opcode 0xe8, followed by a placeholder for the 32-bit PC-relative reference 
to the target sum. 

The corresponding relocation entry r consists of four fields: 


r.offset - Oxf 
r.symbol - sum 
r.type = R X86. 64 PC32 
r.addend - -4 


These fields tell the linker to modify the 32-bit PC-relative reference starting at | 
offset Oxf so that it will point to the sum routine at run time. Now, suppose that 
the linker has determined that 


ADDR(s) = ADDR(.text) = 0x4004d0 


2. Recall that relocation entries and instructions are actually stored in different sections of the object 
file. The opspump tool displays them together for convenience. 
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and 
ADDR(r.symbol) = ADDR(sum) = Ox4004e8 


Using the algorithm in Figure 7.10, the linker, first computes the run-time 
address of the reference (line 7): 


refaddr = ADDR(s) + r.offset 
0x4004d0 + Oxf 
Ox4004df 


It then updates the reference so that it will point to the sum routine at run time 
(line 8): 


*refptr = (unsigned) (ADDR(r.symbol) + r.addend - refaddr) 
(unsigned) (0x4004e8 * (-4) - 0x4004df) 
(unsigned) (0x5), 


In the resulting executable object file, the call instruction has the following 
relocated form: 


4004de: e8 05 00 00 00 callq 4004e8 «sum» sum() 


At run time, the cali instruction will be located at address 0x4004de. When 
the CPU executes the call instruction, the PC has a value of 0x4004e3, which 


is the address of the instruction immediately following the call instruction. To 
execute the call instruction, the CPU performs the following steps: 


1. Push PC onto stack 
2. PC —— PC + 0x5 = 0x4004e3 + 0x5 = 0x4004e8 


Thus, the next instruction to execute is the first instruction of the sum routine, 
which of course is what we want! 


Relocating Absolute References 


Relocating absolute references is straightforward. For example, in line 4 in Fig- 
ure 7.11, the mov instruction copies the address of array (a 32-bit immediate value) 
into register Zedi. The mov instruction begins at section offset 0x9 and consists of 
the 1-byte opcode Oxb£, followed by a placeholder for the 32-bit absolute refer- 
ence to array. 

The corresponding relocation entry r consists of four fields: 


r.offset Oxa 
r.Symbol - array 
r.type = R X86, 64 32 
r.addend = 0 


These fields tell the linker to modify the absolute reference starting at offset Oxa 
so that it will point to the first byte of array at run'time. Now, Suppose that the 
linker has determined that 
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(a) Relocated . text section 


00000000004004d0 <main>: 


1 
2 
3 
4 
5 
6 
7 


17 


4004d0: 


4004d4: 


4004d9: 
4004de: 
4004e3: 
4004e7 : 


48 83 ec 08 sub $0x8,Xrsp 

be 02 00 00 00 mov $0x2,X4esi 

bf 18 10 60 00 mov $0x601018, Zedi edi = £array 
e8 05 00 00 00 callq 4004e8 <sum> sum () 

48 83 c4 08 add $0x8,%rsp 

c3 retq 


00000000004004e8 


4004e8: 
4004ed: 
4004f2: 


4004f4: 
40047: 


4004fa: 
4004fd: 
4004ff: 
400501: 


b8 00 mov $0x0, eax 

ba 00 mov $0x0 , edx 

eb 09 jmp 4004fd <sumt0x15> 
48 63 movslq edx, rcx 

03 04 add (žrdi, %rcx,4) ;Aeax 
83 c2 add $0x1 , Zedx 

39 f2 cmp %esi,%edx 

Te £3 jl 4004f4 <sum+0xc> 
f3 c3 repz retq 


(b) Relocated .data section 


1 
2 


0000000000601018 <array>: 


601018: 


Oi 00 00 00 02 00 00 00 


Figure 7.12 Relocated .text and .data sections for the executable file prog. The original C code is in 
Figure 7.1. 


ADDR(r.symbol) = ADDR(array) = 0x601018 
The linker updates the reference using line 13 of the algorithm in Figure 7.10: 


*refptr = (unsigned) (ADDR(r.symbol) + r.addend) 
(unsigned) (0x601018 + 0) 
(unsigned) (0x601018) 


In the resulting executable object file, the reference has the following relocated 
form: 


4004d9: bf 18 10 60 00 mov $0x601018,%edi edi = “array 


Putting it all together, Figure 7.12 shows the relocated . text and». data sections 
in the final executable object file. At load time, the loader can copy the bytes 
from these sections directly into memory and execute the instructions without 
any further modifications. 


iPractice.Problem,7:4: (sdlution.page:7 18) 


This problem concerns the relocated program in Figure 7:12(a). 





] 
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A. What is the hex address of the relocated reference to sum in line 5? 
B. What is the hex value of the relocated reference to sum in line 5? 


TUA CU SUCCES A A A eray N ^ UPERUN BUM INIHI QU Ai Parts oci e 
Practice Problem 7.5. Golution page 218)... UDINE 
Consider the call to function swap in object file m. o (Figure 7.5). 

9: | e8 00 00 00 00 tallq e «main*Oxe» swap() 


with the following relocation entry: 


r.offset = Oxa 

r.symbol = swap 

r.type = R X86 64 PC32 
r.addend = -4 


i 
Now suppose that the linker relocates text in m. o to address 0x4004d0 and Swap 
to address 0x4004e8. Then what is the value of the relocated reference to swap in 


the ca1lginstruction? 
ss ee 


7.8 Executable Object Files 


We have seen how the linker merges multiple object files into a single executable 
object file. Our example C program, which began life as a collection of ASCII 
text files, has been transformed into a single binary file that contains all of the 
information needed to load the program into memory and run jt. Figure 7.13 
summarizes the kinds of information in a typical ELF executable file. 


0 
is ELF header 
{ Segment header table 





Maps contiguous file 
sectlons'to run-time 
memory segments 





Read-only memory segment 
(code segment) 


Read/write memory segment 
(data segment) 


Symbol table and 
debugging info are not 
' loaded into memory 
Describes 
object file 

Sections 


Figure 7.13 Typical ELF executable object file. 
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SO — codellinkfprog-exe.d 


Read-only code segment A 
LOAD off 0x0000000000000000 vaddr 0x0000000000400000 paddr 0x0000000000400000 align 24421 3 
filesz 0x000000000000069c memsz 0x000000000000069c flags r-x : 


Read/write data segment E 
3 LOAD off 0x0000000000000df8 vaddr 0x0000000000600df8 paddr 0x0000000000600df8 align 2#+21 P. ' 
4 filesz 0x0000000000000228 memsz 0x0000000000000230 flags rw- » 


M —— — codellink/prog-exe.d 


Figure 7.14 Program header table for the example executable prog. off: offset in object file; 
vaddr/paddr: memory address; align: alignment requirement; filesz: segment size in object file; 
memsz: segment size in memory; flags: run-time permissions. 


The format of an executable object file is similar to that of a relocatable object 
file. The ELF header describes the overall format of the file. It also includes the 
program’s entry point, which is the address of the first instruction to execute when 
the program runs. The .text, .rodata, and .data sections are similar to those in 
a relocatable object file, except that these sections have been relocated to their 
eventual run-time memory addresses. The . init section defines a small function, 
called _init, that will be called by the program’s initialization code. Since the 
executable is fully linked (relocated), it needs no .rel sections. 

ELF executables are designed to be easy to load into memory, with contigu- 
ous chunks of the executable file mapped to contiguous memory segments. This 
mapping is described by the program header table. Figure 7.14 shows part of the 
program header table for our example executable prog, as displayed by OBIDUMP. 

From the program header table, we see that two memory segments will be 
initialized with the contents of the executable object file. Lines 1 and 2 tell us 
that the first segment (the code segment) has read/execute permissions, starts at 
memory address 0x400000, has a total size in memory of 0x69c bytes, and is 
initialized with the first 0x69c bytes of the executable object file, which includes 
the ELF header, the program header table, and the .init, .text, and .rodata 
sections. 

Lines 3 and 4 tell us that the second segment (the data segment) has read/write 
permissions, starts at memory address 0x600d£8, has a total memory size of 0x230 
bytes, and is initialized with the 0x228 bytes in the .data section starting at offset 
Oxd£8 in the object file. The remaining 8 bytes in the segment correspond to .bss 
data that will be initialized to zero at run time. 

For any segment s, the linker must choose a starting address, vaddr, such that 


vaddr mod align = off mod align 


where off is the offset of the segment’s first section in the object file, and align 
is the alignment specified in the program header (2?! = 0x200000). For example, 
in the data segment in Figure 7.14, 





Section 7.9 Loading Executable Object Files 697 


vaddr mod align = 0x600df8 mod 0x200000 = 0xdf8 


off mod align = Oxdf8 mod 0x200000 = Oxdf8 


This alignment requirement is an optimization that enables segments in the object 
file to be transferred efficiently to memory when the program executes. The reason 
is somewhat subtle and is due to the way that virtual memory is organized as large 
contiguous power-of-2 chunks of bytes. You will learn all about virtual memory in 
Chapter 9. 


7.9 Loading Executable Object Files 


To run an executable object file prog, we can type its name to the Linux shell's 
command line: 


linux? ./prog 


Since prog does not correspond to abuilt-in shell command, the shell assumes that 
prog is an executable object file, which it runs for us by invoking some memory- 
resident operating system code known as the 1oader. Any Linux program can 
invoke the loader by.calling the execve function, which we will describe in detail in 
Section 8.4.6. The loader copies the code and data jn the executable object file from 
disk into memory and then runs the program by jumping to its first instruction, or 
entry point. 'This process of copying the program into memory and then running 
it is known as loading. 

Every running Linux program has a run-time memory image similar to the 
one in Figure 7.15. On Linux x86-64 systems, the code segment starts at address 
0x400000, followed by the data segment. The run-time heap follows the data 
segment and grows upward via calls to the malloc library. (We will describe malloc 
and the heap in detail in Section 9.9.) This is followed by a region that is reserved 
for shared modules. The user stack starts below the:largest legal user address 
(258 — 1) and grows down, toward smaller memory addresses. The region above 
the stack, starting at address 2*5, is reserved for the code‘And data in the kernel, 
which is‘the memory-resident part of the operating system. 

For simplicity, we've drawn the heap, data, and code segments as abutting 
each other, and we've ‘placed the top of the stack at the largest legal user ad- 
dress. In practice, there is à gap betweeh the code and data segments due to the 
alignment requirement on the . data segment (Section 7.8). Also, the linker uses 
address-space layout randomization (ASLR, Section 3.10.4) when it assigns run- 
time addresses to the stack, shared library, and heap segments. Even though the 
locations of these regions change each time the program is run, their relative po- 
sitions are the same. 

When the loader tuns, it creates a memory image similar to the one shown 
in Figure 7.15. Guided by the program header table, it copies chunks of the 
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Figure 7.15 

Linux x86-64 run-time 
memory image. Gaps 
due to segment alignment 
requirements and address- 
space layout randomization 
(ASLR) are not shown. Not 
to scale. 
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executable object file into the code and data segments. Next, the loader jumps 
to the program’s entry point, which is always the address of the _start function. 
This function is defined in the system object file crti.o and is the same for all C 
programs. The _start function calls the system startup function, _.libc_start_ 
main, which is defined in 1ibc.so. It initializes the execution environment, calls 
the user-level main function, handles its return value, and if necessary returns 
control to the kernel. 


7.10 Dynamic Linking with Shared Libraries 


The static libraries that we studied in Section 7.6.2 address many of the issues as- 
sociated with making large collections of related functions available to application 
programs. However, static libraries still have some significant disadvantages. Static 
libraries, like all software, need to be maintained and updated periodically. If ap- 
plication programmers want to use the most recent version of a library, they must 


somehow become aware that,the library has changed and then explicitly relink 4 


their programs against the updated library. 


Another issue is that almost every C program uses standard I/O functions such E 
as printf and scanf. At run time, the code for these functions is duplicated inthe 4 
text segment of each running process. On a typicalsystem that is running hundreds 4 
of processes, this can be a significant waste of scarce memory system resources. 1 
(An interesting property of memory is that itis always a scarce resource, regardless 4 











Section 7.10 Dynamic Linking with Shared Libraries 


FR eae PES “eh OD of, e Hie Gu Ot sam 


Aside How do loaders really, work?* : ^g 


Ourdescription of loading is conceptually correct but intentionally not entirely accurate. To tinderstand* 
. how loading really works, you must understand the'concepts of processes, virtual memory, and memory 
mapping, which we haven’t discussed yet. As we encounter these concepts later in Chapters 8 and 9, 
we will revisit loading and gradually reveal the mystery loyou > 
For the impatient" reader, here is a preview of how loading really works: Each’ program in a Linux 
, system runs in the context of a process with its own virtual address space. When the shell runs a program, 
the parent shell process forks a child process that ig a duplicate of the parent. The child process invokes 
‘the loader via the execve system call. The loader deletes the hild’s existing virtual memory segments 
and creates a‘new set of code, data, heap, and stack segments; The new stack:and heap segments are 
initialized to zero. The new code and data ségments are initialized to the contents of the executable 
file by mapping pages in the virtual addr€s$spacé to page-size chunks of the.executable file. Finally, 
the loader j jumps to the : start address, which eventually calls the application’s main routine. Aside 
from some héader inforniaticn, there is no copying of data from disk to memory during loading. The 
copying is deferred uńtil the CPU references a mapped virtual page, at which point the operating system 
automatically transfers the page from disk to Hemos using its paging mechanism. 
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of how much there is in a system. Disk space and kitchen trash cans share this same 
property.) 

Shared libraries are modern innovations that address the disadvantages of 
static libraries. A shared library is an object module that, at either run time or load 
time, can be loaded at an arbitrary memory address and linked with a program in 
memory. This process is known as dynamic linking and is performed by a program 
calléd a dynamic linker. Shared libraries are also referred to as shared objects, and 
on Linux systems they are indicated by the . so suffix. Microsoft operating systems 
make heavy use of shared libraries, which they refer to as DLLs (dynamic link 
libraries). 

Shared libraries are "shared" in two different ways. First, in any given file 
system, there is exactly one .so file for a particular library. The code and data in 
this . so file are shared by all of the executable object files that reference the library, 
as opposed to the contents of static libraries, which are copied and embedded in 
the executables that reference them. Second, a single copy of the . text section of 
a shared library in memory can be shared by different running processes. We will 
explore this in more detail when we study virtual memory in Chapter 9. 

Figure 7.16 summarizes the dynamic linking process for the example program 
in Figure 7.7. To build a shared library libvector.so of our example vector 
routines in Figure 7.6, we invoke the compiler driver with some special directives 
to the compiler and linker: 


linux» gcc -shared -fpic -o libvector.so addvec.c multvec.c 


The -fpic flag directs the compiler to generate position-independent code (more 
on this in the next section). The -shared flag directs the linker to create a shared 
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Figure 7.16 main2.c  vector.h 
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object file. Once we have created the library, we would then link itinto our example 
program in Figure 7.7: 


linux? gcc -o prog2l main2.c ./libvector.so 


This creates an executable object file prog21 in a form that can be linked with 
libvector.so at run time. The basic idea is to do some of the linking statically A 
when the executable file is created, and then complete the linking process dynami- 
cally when the program is loaded. It is important to realize that none of the code or 
data sections from libvector.so are actually copied into the executable prog21 
at this point. Instead, the linker copies some relocation and symbol table informa- | 





tion that will allow references to code and data in libvector.so to be resolved 
at load time. 

When the loader loads and runs the executable prog21, it loads the partially 
linked executable prog21, using the techniques discussed in Section 7.9. Next, it 
notices that prog21 contains a . interp section, which contains the path name of | | 
the dynamic linker, which is itself a shared object (e.g., 1à-1inux.so on Linux 
systems). Instead of passing control to the application, as it would normally do, 
the loader loads and runs the dynamic linker. The dynamic linker then finishes the i 
linking task by performing the following relocations: 1 





* Relocating the text and data of libc.so into some memory segment 

e Relocating the text and data of libvector .so into another memory segment 

* Relocating any references in prog21 to symbols defined by libc.so and a 
libvector.so 3 | 
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Finally, the dynamic linker passes control to thé application. From this point on, 
the locations of the shared libraries are fixed and do not change during execution 
of the program. f 





7.11 Loading and Linking Shared Libraries from Applications 


Up to this point, we have discussed the scenario in'which the dynamic linker loads 

and links shared libraries when an application is loaded, just before it executes. 
However, it is also possible for an application to request the dynamic linker to 
load and link arbitrary shared libraries while the application is running, without 
having to link in the applications against those libraries at compile time. 

i Dynamic linking is a powerful and useful technique. Here are some examples 

| in the real world: 

| 





* Distributing software. Developers of Microsoft Windows applications fre- 

quently use shared libraries to distribute software updates. They generate 

F, a new copy of a shared library, which users can then download and use as a 

replacement for the current version. The next time they run their application, 
it will automatically link and load the new shared library. 


* Building high-performance Web servers. Many Web servers generate dynamic 

content, such as personalized Web pages, account balances, and banner ads. 

Early Web servers generated dynamic content by using fork and execve 

to create a child process and run a “CGI program” in the context of the 

: child. However, modern high-performance Web servers can generate dynamic 

P? content using a more efficient and sophisticated approach based on dynamic 

| linking. 

The idea is to package each function that generates dynamic content in 

1 a shared library. When a request arrives from a Web browser, the server 

ı dynamically loads and links the appropriate function and then calls it directly, 

as opposed to using fork and execve to run the function in the context of a 

child process. The function remains cached in the server's address Space, so 

subsequent requests can be handled at the cost of a simple function call. This 

can have a significant impact on the throughput of a busy site. Further, existing 

functions can be updated and new functions can be added at run time, without 
stopping the server. 


Linux systems provide a simple interface to the dynamic linker that allows 
application programs to load and link shared libraries at run time. 


#include <dlfcn.h> 


void *dlopen(const char *filename, int flag); 
Returns: pointer to handle if OK, NULL on error 
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The dlopen function loads and links the shared library filename. The external 
symbols in filename are resolved using libraries previously opened with the RTLD_ 
GLOBAL flag. If the current executable was compiled with the -rdynamic flag, then 
its global symbols are also available for symbol resolution. The flag argument 
must include either RTLD NOW, which tells the linker to resolve references to 
external symbols immediately, or the RTLD. LAZY flag, which instructs the linker 
to defer symbol resolátion until code from the library is executed. Either of these 
values can be ored with the RTLD_GLOBAL flag. 


#include «dlfcn.h» 


void *dlsym(void *handle, char *symbol); 
Returns: pointer to symbol if OK, NULL on'error 





The disym function takes a handle to'a previously opened shared library and 
a symbol name and returns the address of the symbol, if it exists; or NULL 
otherwise. 


#include «dlfcn.h» 


int dlclose (void *handle); 
Returns: 0 if OK, —1 on error 





The dlclose function unldads the shared library if no other shared libraries are 
still using it. ‘ 


#include <dlfcn.h> 


const char *dlerror(void) ;. E 


Returns: éttor message if previous call to dlopen, d1sym, or d1close failed; 
NULL if previous call was OK 





The dlerror function returns a string describing the most recent error that oc- 
curred as a result of calling dlopen, dlsym, or diclose, or NULL if no error 
occurred. 

Figure 7.17 shows how we would use this interface to dynamically link our 
libvector.so shared library at run time and then invoke its addvec routine. To 
compile the program, we would invoke acc in the following way: 


linux? gcc -rdynamic -o prog2r dll.c -ldl 
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saast code/link/dll.c 


1 include <stdio.h> 

2 #include <stdlib.h> 

3  fstinclude «dlfcn.h» 

4 

5 int x[2] = (1, 2}; " 

6 int y[2] = (3, 4}; t 

7 int z[2]; ' 

8 

9 int main() 

10 { 

11 void *handle; 

12 ‘Void (*addvec)(int *, int *, int *, int); 

3t. char *error; 

14 ? 

15 /* Dynamically load the- shared library containing addvec() */ 

16 f -handle = dlopen("./libvector.so", RTLD:LAZY) ; 

17 if '(!handle) { 

18 1 fprintf(stderr, "%s\n", dlerror()); 

19 exit(1); 2 

20 } ts 

21 at n 

22 /* Get a pointer to the addvec() function we ‘just’ loaded */ 

23 addvec.- dlsym(handle, "addvec"); 

24 if (error = dlerror()) != NULL) -{ 

25 fprintf(stderr, "%s\n", error); 

26 exit(1); 

27 } 

28 

29 /* Now we can call addvec() just like any other function */ 

30 addvec(x, y, z, 2); 

31 printf("z:- [4d XdlWn") z[0], z[1D; 

32 | 
331 /* Unload the shared library */ | " 4 
34 if (dlcloge(handle) < 0) { 3 4 
35 fprintf(stderr, "%s\n", dlerror()); 
je 4 exit(1); à 
37 } j 
38 * return 0; i 
3. ) 1 
—-—— ee M —  — — — — code/tink/dil ' | 


Figure-7.17 Example program 3. Dynamically loads and links the shared library 
libvector so at run time. k 
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Aside Shared libraries ahd the Java Native Interface NEC 

Java defines a standard calling corivention-called Java ‘Native Interface (JNJ), that allows “native” C 
and C++ functioris to be called from Java prögrañnš The basic’idea of JNI isto ggmpile-the native C 
function, say, foo, into a shared library, say, foo. so. Whén a ruhning-‘Javaprogram atteihpts tà invoke 
function foo, the Java interpreter’ uses the dloper*inteiface (or somethirig like it) to ‘dynainically link 
and load foo.so and then call foo. * Y tà A Pa 


E E Xo a os 


7.12 Position-Independent Code (PIC) 


A key purpose of shared libraries is to allow multiple running processes to share 
the same library code in memory and thus save precious memory resources. So 
how can multiple processes share a single copy of a program? One approach would 
be to assign a priori a dedicated chunk of the address space to each shared library, 
and then require the loader to always load the shared library at that address. 
While straightforward, this approach creates some serious problems. It would 
be an inefficient use of the address space because portions of the space would 
be allocated even if a process didn't use the library. It would also be difficult to 
manage. We would have to ensure that none of the chunks overlapped. Each time 
a library was modified, we would have to make sure that it still fit in its assigned 
chunk. If not, then we would have to find a new chunk. And if we created a 
new library, we would have to find room for it. Over time, given the hundreds 
of libraries and versions of libraries in a system, it would be difficult to keep the 
address space from fragmenting into lots of small unused but unusable holes. Even 
worse, the assignment of libraries to memory would be different for each system, 
thus creating even more management headaches. 

To avoid these problems, modern systems compile the code segments of 
shared modules so that they can be loaded anywhere in memory without having to 
be modified by the linker. With this approach, a single copy of a shared module's 
code segment can be shared by an unlimited number of processes. (Of course, each 
process will still get its own copy of the read/write data segment.) 

Code that can be loaded without needing any relocations is known as position- 
independent code (PIC). Users direct GNU compilation systems to generate PIC 
code with the -£pic option to acc. Shared libraries must always be compiled with 
this option. 

On x86-64 systems, references to symbols in the same executable object mod- 
ule require no special treatment to be PIC. These references can be compiled using 
PC-relative addressing and relocated by the static linker when it builds the object 
file. However, references to external procedures and global variables that are de- 
fined by shared modules require some special techniques, which we describe next. 


PIC Data References 


Compilers generate PIC references to global variables by exploiting the following 
interesting fact: no matter where we load an object module (including shared 
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Figure 7.18 Using the GOT to reference a global variable. The addvec routine in 
libvector.so references addcnt indirectly through the GOT fór 1ibvector.so. 








object modules) in memory, the data segment is always the same distance from 
the code segment. Thus, the distance between any instruction in the code segment 
and'any variable in the/datà segment is a run-time constant, independent of the 
absolute memory locations of the code and data segments. 

Compilers that want to generate PIC'references to global variables exploit 
this fact by creating a table called the global offset table (GOT) at the beginning 
of the data segment. The GOT contains an 8-byte entry for each global data 
object (procedure or global variable) that is referenced by the object module. 
The compiler also generates a relocation record for each entry in the GOT. At 
load time, the dynamic linker relocates each GOT entry so that it contains the 
absolute address of the object. Each object module that references global objects 
has its own GOT. 

Figure 7.18 shows the GOT from our example libvector.so shared module. 
The addvec routine loads the address of the global variable addcnt indirectly via 
GOT [3] and then increments addcnt in memory. The key idea here is that the offset 
in the PC-relative reference to GOT[3] is a run-time constant. 

Since addcnt isdefined by the libvector.so module, the compiler could have 
exploited the constant distance between the code and data segments by generating 
a direct PC-relative reference to addent and adding a relocation for the linker 
to resolve when it builds the shared module. However, if addcnt were defined 
by another shared module, then the indirect access through the GOT would be 
necessary. In this case, the compiler has chosen to use the most general solution, 
thé GOT, for all references. 





PIC Function Calls 


Suppose that a program calls a function that is defined by a shared library. The 
compiler has no way of predicting the run-time address of the function, since 
the shared module that defines it could be loaded anywhere at run time. The 
normal approach would be to generate a relocation record for the reference, which 
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the dynamic linker could then resolve when the program was loaded. However, 
" this approach would not be PIC, since it would require the linker to modify the 
: code segment of the calling module. GNU compilation systems solve this problem 


using an interesting technique, called lazy binding, that defers the binding of each 
procedure address until tbe first time the procedure is called. 

The motivation for lazy binding is that a typical application program will 
call only a handful of the hundreds or thousands of functions exported by a 
shared library such as libc.so. By deferring the resolution of a function's address 
until it is actually called, the dynamic linker can avoid hundreds or thousands 
of unnecessary relocations at load time. There is a nontrivial run-time overhead 
the first time the function is called, but each call thereafter costs only a single 
instruction and a memory reference for the indirection. 

Lazy binding is implemented with a compact yet somewhat complex interac- 
tion between two data structures: the GOT and the procedure linkage table (PLT). 
If an object module calls any functions that are defined in shared libraries, then it 
has its own GOT and PLT. The GOT is part of the data segment. The PLT is part 
of the code segment. 

Figure 7.19 shows how the PLT and GOT work together to resolve the address 
of a function at run time. First, let’s examine the contents of each of these tables. 


a, 
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Procedure linkage table (PLT). The PLT is an array of 16-byte code, entries. 
PLT[0] is a special entry that jumps into the dynamic linker. Each shared 
library function called by the executable has its own PLT entry. Each of 













Data segment Z 
Global offset table (GOF) 
GOT[O]: addr of .dynamic 
GOT[1]: addr of reloc entries 
GOT[2]: addr of dynamic linker 
GOTÍ3]: 0x4005b6 # sys startup 


GOT[4]: &addvec( 
GOT[5]: Ox4005d6 # printf() 


it E 


mis jt 


Data segment, y , P” € 
ty 
Global offset table{GOT# . . 


GOTEO]: addr of .dynamic 

GOT[1]: addr of reloc entries 
GOT[2]: addr of dynamic linker 
GOT[3]: Ox4005b6 # sys startup 
GOT[4]: Ox4005c6 # addvec() 
GOT[5]: Ox4005d6 # printfO 
























































Code segment . be Code segment G 


Y 
a * 









è 28 " * š * L3 : 
callq 0x4Q05¢0 # Call "addvec C) cal]q.0x4008c0 “#"gald addvec() 


qo. Bhe hv 
Procedure linkage table (PLT) 
# PLT[0]: call dynamic linker 
400520: pushq *GOT[1] 


4005a6: jmpq *GOT[2] 


LE" DRE" 
Précéduré linkage tabie (PLT) 
3 PLT[0]: call dynamic linker 


4005a0: pushq *GOT[1] 
4005a6: jmpq *GOT[2] 



















# PLT[2]: call addvec() 
4005c0: jmpq -*GOT[4] 
4005c6: pushq $0x1 
4005cb: jmpq 4005a0 


| # PLT[2]: call addvec() 
H 4005cO: jmpq *GOT[4] 
Fi 4005c6: pushq $0x1 
4005cb: jmpq 4005a0 









(a) First invocation of addvec (b) Subsequent invocations of addvec 


Figure 7.19 Using the PLT and GOT to call external functions. The dynamic linker resolves the address of 
addvec the first time it is called. 
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these entries is responsible for invoking a specific ‘function. PLT [1] (not 
shown here) invokes the system startup function (__libc_start_main), 
which initializes the execution environment, calls the main function, and 
handles its return value. Entries starting at PLT [2] invoke functions called 
by the user code. In our example, PLT [2] invokes addvec and PLT [3] (not 
shown) invokes printf. 


Global offset tablé (GOT). As we have seen, the GOT is an array of 8-byte 
address entries. When used in conjunction with the PLT, GOT[0] and 
GOT [1] contain information that the dynamic linker uses when it resolves 
functión addresses. GOT[2] is the entry point for the dynamic linker in 
the 1d-1inux.so module. Each of the remaining eftries corresponds to 
à called fünctibn whose address needs to be resolved at run time. Each 
has a matching PLT entry. For example, GOT[4] and PLT[2] correspond 
to addvec. Initially, each GOT efitry points to the second instruction in 
the corresponding PLT entry. 


Figure 7.19(a) shows how the GOT and PLT work together to lazily resolve 
the run-time address of function addvéc the first time it is called: 


Step I. Instead of directly calling addvec, the program calls into PLT [2] , which 
is the PLT entry for addvec. 


Step 2. The first PET instruction does an indirect jump through GOT [4]. Since 
each GOT entry initially points to the second instruction in its correspond- 
ing PLT entry, the indirect jump simply transfers control back to the next 
instruction in PLT [2]. ' 


Step 3. After pushing an ID for addvec (0x1) onto the stack, PLT [2] jumps to 
'  PLT[O]. 


Step 4. PLT(0] pushes an argument for the dynamic linker indirectly through 
GOT[1] and then jumps into the dynamic linker indirectly through GOT [2]. 
The dynamic linker uses the two stack entries to determine the run- 
time location of addvec, overwrites GOT [4] with this address, and passes 
control to addvec. 


Figure 7.19(b) shows the control flow for any subsequent invocations of 
addvec: 
Step 1. Control passes to PLT [2] as before. 


Step 2. However, tliis time the indirect jump through GOT [4] transfers control 
1 directly to addvec. 


7.13 Library Interpositioning 


Linux linkers support a powerful technique, called library interpositioning, that 
allows you to intercept calls to shared library functions and execute your own code 
instead. Using interpositioning, you could trace the number of times a particular 
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library function is called, validate and trace its input and output values, or even 
replace it with a completely different implementation. 

Here’s the basic idea: Given some target function to be interposed on, you 
create a wrapper function whose prototype is identical to the target function. Using 
some particular interpositioning mechanism, you then trick the system into calling 
the wrapper function instead of the target function. The wrapper function typically 
executes its own logic, then calls the target function and passes its return value 
back to the caller. 

Interpositioning can occur at compile time, link time, or run time as the 
program is being loaded and executed. To explore these different mechanisms, 
we will use the example program in Figure 7.20(a) as a running example. It calls 
the malloc and free functions from the C standard library (1ibc.so). The call to 
malloc allocates a block of 32 bytes from the heap and returns a pojnter to the 
block. The call to free gives the block back to the heap, for use by subsequent 
calls to malloc. Our goal is to use interpositioning to trace the calls to malloc and 
free as the program runs. 


7.13.1 Compile-Time Interpositioning 


Figure 7.20 shows how to use the C preprocessor to interpose at compile time. 
Each wrapper function in mymalloc.c (Figure 7.20(c)) calls the target function, 
prints a trace, and returns. The local malloc.h header file (Figure 7.20(b)) instructs 


the preprocessor to replace each call to a target function witli a call fo its wrapper. 
Here is how to compile and link the program: 


linux» gcc -DCOMPILETIME -c mymalloc.c 
linux? gcc -I. -o intc int.c mymalloc.o 


The interpositioning happens because of the -I. argument, which. tells the C 
preprocessor to look for malloc.h in the current directory before looking in the 
usual system directories. Notice that the wrappers in mymalloc.c are compiled 
with the standard malloc.h header file. 

Running the program gives the following trace:, 


linux? ./intc 
malloc(32)-0x9ee010 
free (0x9ee010) 


7.13.2 Link-Time Interpositioning 


The Linux static linker supports link-time interpositioning with the —wrap, £ flag. 

This flag tells the linker to resolve references to symbol f as __wrap_f (two 3 

underscores for the prefix), and to resolve references to symbol .. real fí ; 

(two underscores for the prefix) as f. Figure 7.21 shows the wrappers for our 4 

example program. 1 
Here is how to compile the source files into relocatable object files: 


linux» gcc -DLINKTIME -c mymalloc.c 
linux? gcc -c int.c 
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(a) Example program int :c 
Uomo — —3 code/link/interpose/int.c 


1 #include <stdio.h> 
2 #inclhude <malloc.h> 
3 š 
4 int main() 
5-4 
6 int *p = malloc(32); . 
7 free(p); 
8 return(6) ; 
9 } 
a SS code (ineh/uuerpaseintc 


(b) Local malloc.h file 
m —— — — code/link/interpose/malloc.h 


define malloc(size) mymalloc(size) 
‘#define free(ptr) myfree(ptr) 


void *mymalloc(size t size); 
void myfree(void *ptr); 


T —— ——» code/link/interpose/malloc.h 


(c) Wrapper functions in mymalloc.c 
sw — — — — code/link/interpose/mymalloc.c 


U A wN = 
" 


1 #ifdef COMPILETIME 

2 #include <stdio.h> 

3 #include <malloc.h> 

4 

5  7* malloc wrapper function */ r 

6 void *mymalloc(size_t size) 

7 { q g x 

8 void *ptr = malloc(size); 

9 print? @malloc(%d) =4p\n", 

10 (int)size, ptr}; 

n return ptr; 

12. 3 

13 

14  /* free wrapper function */ 

15 void myfree(void *ptr) 

16 í 

17 free(ptr); 

18 printf("free(%p)\n", ptr); ; 
19 ) 

20 #endif "E 
MR ——  code/link/interpose/mymalloc.c 


" 


: à ti 
Figure 7.20 Compile-time interpositioning with ithe C preprocessor. , 
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a code/link/interpose/mymdlloc.c 


#ifdef LINKTIME 
#include <stdio.h> 


void *__real_malloc(size_t size); 
void __real_free(void *ptr); 


/* malloc wrapper function */ 
void *__wrap_malloc(size_t size) 


{ 


wo 0 wW AW RNC 


void *ptr = . real malloc(size); /* Call libc malloc */ 
printf ("malloc(%d) = %p\n", (int)size, ptr); 
return ptr; ' 


} 


/* free wrapper function */ 

void . wrap free(void *ptr) 

1 
real free(ptr); /* Call libc free */ 
printf ("free(%p)\n", ptr); 

} 

itendif 


SS —— code/link/interpose/mymalloc.c 


Figure 7.21 Link-time interpositioning with the --wrap flag. 


And here is how to link the object files into an executable: 
linux» gcc -Wl,--wrap,malloc -W1,--wrap,free -o intl int.o mymalloc.o 


The -Wl,option flag passes option to the linker. Each comma, in option is 
replaced with a space. So -W1 ,--wrap ,malloc passes --wrap malloc to the linker, 
and similarly for -W1 ,--wrap, free. 

Running the program gives the following trace: 


linux> ./intl 
malloc(32) = Ox18cf010 
free (0xi8c¢cf010) 


7.13.3 Run-Time Interpositioning 


Compile-time interpositioning requires access to a program's source files. Link- 1 
time interpositioning requires access to its relocatable object files. However, there 1 
is a mechanism for interpositioning at run time that requires access only to the | 
executable object file. This fascitiating mechanism is based on the dynamic linker's ; 
LD. PRELOAD environment variable. 
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If the Lb ,PRELOAD environment yariable is set to a list of shared library. 
pathnames (separated by spaces or colons), then when you load and execute a 
program, the dynamic linker (Lp-LiNUX.so) will search the LD_PRELOAD libraries 
first, before any other shared libraries, when it resolves undefined references. With 
this mechanism, you can interpose on any function in any shared library, including 
libc.so, when you load and execute any executable. 

Figure 7.22 shows the wrappers for malloc and free. In each wrapper, the 
call to dlsym returns the pointer to the target libc function. The wrapper then 
calls the target function, prints a trace, and returns. : 

Here is how to build the shared library that contains the wrapper functions: 


linux» gcc -DRUNTIME -shared -fpic -o mymalloc.so mymalloc.c -idl 


Here is how to compile the main program: 
, 
linux? gcc -o intr int.c 


Here is how to run the program from the bash shell? 


linux» LD.PRELDADs"./mymalloc.so" ./intr 
malloc(32) = Ox1bf7010 
free (Oxibf7010) 


And here is how to run it from the csh or tcsh shells: 


linux» (setenv LD_PRELOAD "./mymalloc.so"; ./intr; unsetenv LD_PRELOAD) 
malloc(32) = 0x2157010 
free(0x2157010) 


Notice that you can use LD_PRELOAD to interpose on tHe library calls of any 
executable program! 


linux» LD_PRELOAD="../mymalloc.so" /usr/bin/uptime 
malloc(568) = 0x21bb010 

free (0x21bb010) 

Malloc(16) = Ox21bb010 

malloc(568) = 0x21bb030 

malloc(2255) = Ox21bb270 


free(0x21bb030) 

malloc(20) = Ox21bb030 
malloc(20) = Ox2ibb050 
malloc(20) = 0x21bb070 
malloc(20) = 0x21bb090 
malioc(20) = Ox21bbObO 


malloc (384) = Ox21bb0dO 
20:47:36 up 85 days, 6:04, 1 user, .load average: 0.10, 0.04, 0.05 


3. If you don't know what shell you are running, type printenv SHELL at the command line. 
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code/link/interpose/mymailoc.c 


#ifdef RUNTIME t 
#define _GNU_SOURCE 

#include <stdio.h> 

#include <stdlib.h> 

#include «dlfcn.h» 


T 


/* malloc wrapper function */ 
void *malloc(size_t size) 


{ 


} 


void *(*mallocp) (size_t size); 
char *error; 


mallocp = dlsym(RTLD NEXT, "malloc"); /* Get address of libc malloc */ 
if ((error = dlerror()) != NULL) i 
fputs(error, stderr); 
exit(1); 
} 
char *ptr = mallocp(size); /* Call libc malloc */ 
printf ("malloc(%d) = %p\n", (ànt)size, ptr); 


return ptr; 
" 


/* free wrapper function */ 


void free(void *ptr) 


i 


} 


void (*freep) (void *) = NULL; 
char *error; 


if (!ptr) 
return; 


freep = dlsym(RTLD_NEXT, "free"); /* Get address of libc free */ 
if ((error = dlerror()) != NULL) ( 
fputs(error, stderr); 
exit(1); 
} 
freep(ptr); /* Call libc free */ 
printf("free(Ap)Wn", ptr); 


#endif 


code/link/interpose/mymalloc.c 


Figure 7.22 Run-time interpositioning with LD_PRELOAD. 
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7.14 Tools for Manipulating Object Files 


There are a number of tools available on Linux systems to help you understand 
and manipulate object files. In particular, the GNU binutils package is especially 
helpfyl and runs on every Linux platform. 


AR. Creates static libraries, and inserts, deletes, lists, and extracts members. 
STRINGS. Lists all of the printable strings contained in an object file. 

STRIP. Deletes symbol table information from an object file. 

NM. Lists the symbols defined in the symbol table of an object file. 

SIzE. Lists the names and sizes of the sections in an object file. 


READELF. Displays the complete structure of an object file, including all of the 
information encoded 1 in the ELF header. Subsumes the functionality of 
SIZE and NM. 


OBJDUMP. The mother of all binary tools. Can display all of the information in an 
object file. Its most useful function is disassembling the binary instructions 
in the . text section. 


Linux systems also provide the LDD program for manipulating shared libraries: 


LDD: Lists the shared libraries that an executable needs at run time. 


7.15 Summary 


Linking can be performed at compile time by static linkers and at load time and run 
time by dynamic linkers. Linkers manipulate binary files called object files, which 
come in three different forms: relocatable, executable, and shared. Relocatable 
object files are combined by static linkers into an executable object file that can 
be loaded into memory and executed, Shared objectifiles (shared libraries) are 
linked and loaded by dynamic linkers at run time, either implicitly when the calling 
program is loaded and begins executing, or on demand, when the program calls 
functions from the dlopen library. 

The two main tasks of linkers are symbol resolutión, where each global symbol 
in an object file is bound to a unique definition, and relocation, where the ultimate 
memory address for each symbol is determined and where references to those 
objects are modified. 

Static linkers are invoked by compiler drivers such as acc. They combine 
multiple relocatable object files into a single executable object file. Multiple object 
files can define the same symbol, and the rules that linkers use for silently resolving 
these multiple definitions can introduce subtle bugs in user programs. 

Multiple object files can be concatenated in a single static library. Linkers 
use libraries to resolve symbol references in other object modules. The left-to- 
right sequential scan that many linkers use to resolve symbol references is another 
source of confusing link-time errors. 
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Loaders map the contents of executable files into memory and run the pro- 
gram. Linkers can also produce partially linked executable object files with un- 
resolved references to the routines and data defined in a sHared library. At load 
time, the loader maps the partially linked executable into memory and then calls 
a dynamic linker, which completes the linking task by loading the shared library 
and relocating the references in the program. 

Shared libraries that are compiled as position-independent code can be loaded 
anywhere and shared at run time by multiple processes. Applications can also use 
the dynamic linker at run time in order to load, link, and access the functions and 
data in shared libraries, 


Bibliographic Notes 


Linking is poorly documented in the computer systems literature. Since it lies at 
the intersection of compilers, computer architecture, and operating systems, link- 
ing requires an understanding of code generation, machine-language program- 
ming, program instantiation, and virtual memory. It does not fit neatly into any of 
the usual computer systems specialties and thus is not well covered by the classic 
texts in these areas, However, Levine’s monograph provides a good general ref- 
erence on the subject [69]. The original 1A32 specifications for ELF and DWARF 
(a specification for the contents df the .debug and . line sections) are described 
in [54]. The x86-64 extensions to the ELF file format are described in [36]. The 
x86-64 application binary interface (ABI) describes the conventions for compil- 
ing, linking, and running x86-64 programs, including the rules for relocation and 
position-independent code [77]. 


Homework Problems 
7.6 ¢ ! 
This problem concerns the n. o module from Figure 7.5 and the following version 


of the swap- c function that counts the number of times it has been called: 


extern int buff]; 


1 

2 E 
3 int *bufpO = &buf [0]; 

4 static int *bufpl; 

5 

6 static void incr() 

7 d 

8 static int count=0; 

9 

10 count+t; 

"n } i 
12 

13 void swap() 

14 i 


Homework Problems 
int temp; 


incrO; 
bufpi = &buf[1]; 
temp = *bufpO; 
*bufpO = *bufp1; 
*bufpi = temp; 

l 


For each symbol that is defined and referenced in swap.o, indicate if it will 
have a symbol table entry in the . symtab section in module swap. o. If so, indicate 
the module that defines the symbol (swap. o or m. o), the symbol type (local, global, 
or extern), and the section (.text, .data, or .bss) it occupies in that module. 


Symbol swap.o .symtab entry? Symbol type Módule where'defined Section 
Sa ee ee Macon cil 


buf 
bufpO 
bufpi 
swap 
temp 
incr 
count 


7.7 © 

Without changing any variable names, modify bar5.c on page 683 so that foo5. c 
prints the correct values of x and y (i.e., the hex representations of integers 15213 
and 15212). 


789 

In this problem, let REF(x.i) — DEF(x.k) denote that the linker will associate an 
arbitrary reference to symbol x in module i to the definition of x in module k. For 
each example below, use this notation to indicate how the linker would resolve 
references to the multiply-defined symbol in each module. If there is a link-time 
error (rule 1), write *ERROR". If the linker arbitrarily chooses one of the definitions 
(rule 3), write “UNKNOWN”. 


A. /* Module 1 */ /* Module 2 x/ 
int main() static int main=1[ 
{ E int p20 
} { 
} 


(a) REF(main.1) > DEFC o) 
(b) REF(main.2) > DEF(. —  .. ..) 
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B. /* Module 1 */ /* Module 2 */ 
int x; double x; 
void main() int p20 
{ { 
} 3 


(a) REF(x.1) — DEF(_ ) 
(b) REF(x.2) — DEF( 1) 


. /* Module 1 */ /* Module 2 */ 
int x-í; double x-1.0; 
void main() int p20 
{ { 

} } 


(a) REF(x.1) > DEF s) 
(b) REF(x.2) > DEF( xin) 


7.9 9 
Consider the following program, which consists of two object modules: 


/* foo6.c */ 
void p2(void); 


int main() 

1 
p20; 
return 0; 


1 
2 
3 
4 
5 
6 
7 
8 


h 


/* bar6.c */ 
#include <stdio.h> 


char main; 
void p20 


{ 


printf ("Ox%x\n", main); 


Ww c 4 Aw BWN A 


} 


When this program is compiled and executed on an x86-64 Linux system, it | 
prints the string 0x48\n and terminates normally, even though function p2 never 
initializes variable main. Can you explain this? ` : 


7.10 € 
Let a and b denote object modules or static libraries in the current directory,and j 
let ab denote that a depends on b, in the sense that b defines a symbol that is | 
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referenced by a. Eor each of the following scenarios, show the minimal command 
line (i.e., one with the leastnumber of object file and library arguments) that will 
allow the static linker to resolve all symbol references: 


A. p.o — libx.a — p.o > 
B. p.o — libx.a-» liby.a and’ liby;à — libx.a 
C. p.o > libx.a — liby.a— libz.a and liby.a — libx.a — libz.a 


7.131 99 
Thl Prógtam heater in Figure 7 A4indicates that the data segment occupies 0x230 
bytes ‘in memory. However; only the*first 0x228 bytes of these come from thé 
sections of the executable file: What cause’ this discrepancy? 


7.12 o@ 
Consider the call to function swap in object file m ^o (Problem 7.6). 


9: | e8 00 00 00 00 callq e <main+Oxe> swap() 


with the following relocation entry: 


fÉ.offset = Oxa 
r.symbol = swap 

r.type = R_X86_64_PC32 

r.addend = -4 


A. Suppose that the linker relocates . text in m. o to address 0x4004e0 and swap 

» getqaddress 0x4004£8. Then-what is the value of the relocated reference to 
swap in the ca11q instruction? 

`B. Suppose that the'linker relocates . tèxt in m.o to address 0x4004d0 and swap 

to address 0x400500. Then what is'the value of the relocated reference to 
swap in the cal1q instruction? 


7.13 o@ ] 
Performing the following tasks will help you become more familiar with the 
various tools for manipulating object files. 


A. How many object files are contained in the versions of libc.a and libm.a 
on your system? 
B. Does gcc *0g produce different executable code than gcc -0g -g? 
C. What shared libraries does the ccc driver on your system use? 
a 


Solutions to Practice Problems 


^od. 


Solution to Problem 7.1 (page 678) 
The purpose of this problem is to help you understand the relationship between 
linker symbols and C variables and functions. Notice that the C local variable temp 
does not have a symbol table entry. 
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Symbol .symtab entry? Symbol type Module where defined Section 
ee ee 


buf Yes extern m.o data 
buf po Yes global swap.o .data 
bufpi Yes global swap.o COMMON 
swap Yes . global swap.o „text 
temp No — — — 


Solution to Problem 7.2 (page 684) 

This is a simple drill that checks your understanding of the rules that a Unix linker 
uses when it resolves global symbols that are defined in more than one module. 
Understanding these rules can help you avoid some nasty programming bugs. 


A. The linker chooses the strong symbol defined in module 1 over the weak 
symbol defined in module 2 (rule 2): 
(a) REF(main.1) — DEF(main.1) 
(b) REF(main.2) — DEF(main.1) 


. This is an ERROR, because each module defines a strong symbol main (rule 1). 





. The linker chooses the strong symbol defined in module 2 over the weak 
symbol defined in module 1 (rule 2): 
(a) REF(x.1) — DEF(x.2) 
(b) REF(x.2) — DEF(x.2) 


Solution to Problem 7.3 (page 689) 

Placing static libraries in the wrong order on the command line is a common source 
of linker errors that confuses many programmers. However, once you understand 
how linkers use static libraries to resolve references, it's pretty straightforward. 
This little drill checks your understanding of this idea: 


A. linux? gcc p.o libx.a 
B. linux» gcc p.o libx.a liby.a 
C. linux? gcc p.o libx.a liby.a libx.a 


Solution to Problem 7.4 (page 694) 

This problem concerns the disassembly listing in Figure 7.12(a). Our purpose 
here is to give you some practice reading disassembly listings and to check your 
understanding of PC-relative addressing. 


A. The hex address of the relocated reference in line 5 is 0x4004df. 


B. The hex value of the relocated reference in line 5 is 0x5. Remember that j 
the disassembly listing shows the value of the reference in little-endian byte 
order. 


Solution to Problem 7.5 (page 695) [ 
This problem tests your understanding of how the linker relocates PC-relative 
references. You were given that 
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ADDR(s) = ADDR(.text) = 0x4004d0 
and 
ADDR(r.symbol) = ADDR(swap) = 0x4004e8 


Using the algorithm in Figure 7.10, the linker first computes the run-time 
address of the reference: : 


refaddr = ADDR(s) + r.offset 
= 0x4004d0 + Oxa 
Ox4004da 


It then updates the reference: 


*refptr = (unsigned) (ADDR(r.symbol) + r.addend - refaddr) 
= (unsigned) (0x4004e8 + (-4) - 0x4004da) 
(unsigned) (Oxa) 


Thus, in the resulting executable object file, the PC-relative reference to swap has 
a value of Oxa: 


4004d9: e8 Oa 00 00 00 callq 4004e8 <swap> 
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rom the time you first apply power to a processor until the time you shut it off, 
the program counter assumes a sequence of values 


Ag, 01, ..., Anny 


where each a, is the address of some corresponding instruction /,. Each transition 
from a; to a;,1 is called a control transfer. A sequence of such control transfers is 
called the flow of control, or control flow, of the processor. 

The simplest kind of control flow is a “smooth” sequence where each /, and 
I,.1 are adjacent in memory. Typically, abrupt changes to this smooth flow, where 
1,41 is not adjacent to /,, are caused by familiar program instructions such as jumps, 
calls, and returns. Such instructions are necessary mechanisms that allow programs 
to react to changes in internal program state represented by program variables. 

But systems must also be able to react to changes in system state that are 
not captured by internal program variables and are not necessarily related to 
the execution of the program. For example, a hardware timer goes off at regular 
intervals and must be dealt with. Packets arrive at the network adapter and must 
be stored in memory. Programs request data from a disk and then sleep until they 
are notified that the data are ready. Parent processes that create child processes 
must be notified when their children terminate. 

Modern systems react to these situations by making abrupt changes in the 
control flow. In general, we refer to these abrupt changes as exceptional control 
flow (ECF). ECF occurs at all levels of a computer system. For example, at the 
hardware level, events detected by the hardware trigger abrupt control transfers 
to exception handlers. At the operating systems level, the kernel transfers control 
from one user process to another via context switches. At the application level, 
a process can send a signal to another process that abruptly transfers control to 
a signal handler in the recipient. An individual program can react to errors by 
sidestepping the usual stack discipline and making nonlocal jumps to arbitrary 
locations in other functions. 

As programmers, there are a number of reasons why it is important for you 
to understand ECF: 


* Understanding ECF will help you understand important systems concepts. ECF 
is the basic mechanism that operating systems use to implement I/O, processes, 
and virtual memory. Before you can really understand these important ideas, 
you need to understand ECF. 


Understanding ECF will help you understand how applications interact with the 
operating system. Applications request services from the operating system by 
using a form of ECF known as a trap or system call. For example, writing data 
toa disk, reading data from a network, creating a new process, and terminating 
the current process are all accomplished by application programs invoking 
system calls. Understanding the basic system call mechanism will help you 
understand how these services are provided to applications. 


* Understanding ECF will help you write interesting new application programs. 
The operating system provides application programs with powerful ECF 
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mechanisms for creating new processes, waiting for processes to terminate, 
notifying other processes of exceptional events in the system, and detecting 
and responding to these events. If you understand these ECF mechanisms, 
then you can use them to write interesting programs such as Unix shells and 
Web servers. 


Understanding ECF will help you understand concurrency. ECF is a basic 
mechanism for implementing concurrency in computer systems. The following 
are all examples of concurrency in action: an exception handler that interrupts 
the execution of an application program; processes and threads whose exe- 
cution overlap in time; and a signal handler that interrupts the execution of 
an application program. Understanding ECF is a first step to understanding 
concurrency. We will return to study it in more.detail in Chapter 12. 


Understanding ECF will help you understand how software exceptions work. 
Languages such as C++ and Java provide software exception mechanisms via 
try, catch, and throw statements. Software exceptions allow the program 
to make nonlocal jumps (i.e., jumps that violate the usual call/return stack 
discipline) in response to error conditions. Nonlocal jumps are a form of 
application-level ECF and are provided in C via the set jmp and longjmp 
functions. Understanding these low-level functions will help you understand 
how higher-level software exceptions can be implemented. 


Up to this point in your study of systems, you have learned how applications 
interact with the hardware. This chapter is pivotal in the sense that you will begin 
to learn how your applications interact with the operating system. Interestingly, 
these interactions all revolve around ECF. We describe the various forms of ECF 
that exist at all levels of a computer system. We start with exceptions, which lie at 
the intersection of the hardware and the operating system. We also discuss system 
calls, which are exceptions that provide applications with entry points into the 
operating system. We then move up a level of abstraction and describe processes 
and signals, which lie at the intersection of applications and the operating system. 
Finally, we discuss nonlocal jumps, which are an application-level form of ECF. 


8.1 Exceptions 


Exceptions are a form of exceptional control flow that are implemented partly 
by the hardware and partly by the operating system. Because they are partly 
implemented in hardware, the details vary from system to system. However, the 
basic ideas are the same for every system. Qur aim in this section is to give you a 
general understanding of exceptions and exception handling and to help demystify 
what is often a confusing aspect of modern computer systems. 

An exception is an abrupt change in the control flow in response to some 
change in the processor's state. Figure 8.1 shows the basic idea. 

In the figure, the processor is executing some current instruction J,,,, when a 
significant change in the processor's state occurs. The state is encoded in various 
bits and signals inside the processor. The change in state is known as an event. 
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Aside Hardwarė versus software exceptions ae C m 


C++ and Java prograinniers' will have noticed ‘that tlie term "éxception" is also used'tó describe the 
application-level ECF méchianism' provided by Cii and Java in the form ef catch, throw, ‘and try 
statements, If we wanted to be perfectly clear, we might distinguish between “hardware” and "Software" 


exceptions, but this is usually tinnecessary becausé the meaning is clear from the, context. 
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Figure 8.1 Application Exception 
Anatomy of an exception. program handler 
A change in the processor's 
state (an event) triggers 
an abrupt control transfer Event Exception 
(an exception) from the OCCUIS cm 
application program to an here ii 
exception handler. After Exception 
it finishes processing, the return 
handier either returns | (optional) 


Exception 
processing 





control to the interrupted 
program or abofts. 


The event might be directly related to the execution of the current instruction. 
For example, a virtual memory page fault occurs, an arithmetic overflow occurs, 
or an instruction attempts a divide by zero. ‘On the other hand, the event might be 
unrelated to the execution of the current instruction. For example, a system timer 
goes off or an I/O request completes. 

In any case, when the processor detects that the event has occurred, it makes 
an indirect procedure call (the exception), through a jump table called an exception 
table, to an operating system subroutine (the exception handler) that is specifically 
designed to process this particular kind of event. When the exception handler 
finishes processing, one of three things happens, depending on the type of event 
that caused the exception: 


| 
| 


1. The handler returns control to the current instruction lwm the instruction 
that was executing when the event occurred. 


^. The handler returns control to next, the instruction that would have executed 
next had the exception not occurred. 


3. The handler aborts the interrupted program. 


A 
Hj 
| Section 8.1.2 says more about these possibilities. 


8.1.1 Exception Handling 


Exceptions can be difficult to understand because handling them involves close 
cooperation between hardware ard software. It is easy to get confused about | 





Figure 8.2 

Exception table. The 
exception table is a 
jump table where entry 
k contains the address 
of the handler code for 
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Figure 8.3 Exception number 
Generating the address (x 8) 
of an exception handler. 


The exception number is À 
an index into the exception Exception table 
table base register 


which component performs which task. Let’s look at the division of labor between 
hardware and software in more detail. 

Each type of possible exception in a system is assigned a unique nonnegative 
integer exception number. Some of these numbers are assigned by the designers 
of the processor. Other numbers are’ assigned by the designers of the operating 
system kernel.(the memory-resident part of the operating system). Examples of 
the former include divide by zero, page faults, memory access violations, break- 
points, and arithmetic overflows. Examples of the latter include system calls and 
signals from external I/O devices. 

At system boot time (when the computer is reset or powered on), the operat- 
ing system allocates and initializes a jump table called an exception table, so that 
entry k contains the address of ihe handler for exception k. Figure 8.2 shows the 
format of an exception table. 

At run time (when the system is executing some program), the processor 
detects that an event has occurred and determines the corresponding exception 
number k. The processor then triggers the exception by making an indirect pro- 
cedure call, through entry k of the exception table, to the corresponding handler. 
Figure 8.3 shows how the processor uses the exception table to form the address of 
the appropriate exception handler. The exception number is an index into the ex- 
ception table, whose starting address is contained in a special CPU register called 
the exception table base register. 

An exception is akin to a procedure call, but with some important differences: 


Address of entry 
for exception # k 


* As with a procedure call, the processor pushes a return address on the stack 
before branching to the handler. However, depending on the class of excep- 
tion, the return address is either the current instruction (the instruction that 
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was executing when the event occurred) or the next instruction (the instruc- 
tion that would have executed after the current instruction had the event not 
occurred). 

* The processor also pushes some additional processor state onto the stack that 
will be necessary to restart the interrupted program when the handler returns. 
For example, an x86-64 system pushes the EFLAGS register containing the 
current condition codes, among other things, onto the stack. 

* When control is being transferred from a user program to the kernel, all of 
these items are pushed onto the kernel's stack rather than onto the user's 
stack. 

* Exception handlers run in kernel mode (Section 8.2.4), which means they have 
complete access to all system resources. 


Once the hardware triggers the exception, the rest of the work is done in 
software by the exception handler. After the handler has processed the event, it 
optionally returns to the interrupted program by executing a special “return from 
interrupt" instruction, which pops the appropriate state back into the processor's 
control and data registers, restores the state to user mode (Section 8.2.4) if the 
exception interrupted a user program, and then returns control to the interrupted 
program. 

E 
| 





8.1.2 Classes of Exceptions 


Exceptions can be divided into four classes: interrupts, traps, faults, and aborts. 
The table in Figure 8.4 summarizes the attributes of these classes. 





















Interrupts 


Interrupts occur asynchronously as a result of signals from I/O devices that are 
external to the processor. Hardware interrupts are asynchronous in the sense 
that they are not caused by the execution of any particular instruction. Exception 
handlers for hardware interrupts are often called interrupt handlers. 

Figure 8.5 summarizes the processirig for an‘ifterrupt. I/O devices such as 
network adapters, disk controllers, and timer chips trigger interrupts by signaling 
a pin on the processor chip and placing onto the system bus the exception number 
that identifies the device that caused the interrupt. 


Class Cause Async/sync Return behavior 


Signal from I/O device Async Always returns to next instruction 
Always returns to next instruction 


Might return to current instruction 


Interrupt 
Trap Intentional exception Sync 
Fault Potentially recoverable error Sync 
Abort Nonrecoverable error Sync Never returns 


Figure 8.4 Classes of exceptions. Asynchronous exceptions occur as a result of events in 1/O devices that 


are external to the processor. Synchronous exceptions occur as a direct result of executing an instruction. 
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Figure 8.5 
Interrupt handling. (2) Control passes 
The interrupt handler (1) Interrupt pin © Macon d 
returns control to the goes high during p UCHOn unt 
next instruction in the execution of ‘next 
ae ; current instruction 
application program's (4) Handler 
control flow. returns to 
next instruction 


Figure 8.6 

Trap handling. The trap 

handler returns control inu 

to the/next instruction in (7) iiri syscall to handier 
the application program’s system call Inext 

control flow. 


(2) Control passes 


(4) Handler returns 
to instruction 
following the syscall ! 


be 


After the current instruction finishes executing, the processor notices that the 


interrupt pin has gone high, reads the exception number from the system bus, and 
then calls the appropriate interrupt handler. When the handler returns, it returns 
control to the next instruction (i.e. , the instruction that would have followed the 
cürrent ‘instruction in thé. control fioi hadthei interrupt not occurred). The effect is 
thatthe program cdntinués executing as though the interrupt had never happened. 

The remaining, classés of exceptións (traps, faults, and aborts) occur syn- 
chronously as a Tesuit of exécuting the current instruction. We refer to this in- 
stfuction as the faulting instruction’ 


Traps and System Calls 


Traps ate inténtional exceptions that ovcur as a result'of executing an instruction. 
Like interrupt'handlers, trap handlérs return cotitrol to the next instruction. The 
most important use of traps is to provide a'précédure-like interfa¢e between user 
programs and the kernel, known as a system call. 

User prograths often need'to request services from the kernel such as reading 
a file (read), creating a new process (fork), loading a new program (execve), and 
terminating the current process (exit). To allow controlled access to such kernel 
services, processors provide a special syscall n instruction that user programs can 
execute When they want to request service n. Executing the syscall instruction 
causes a trap to an exception handler that decodes the argument and calls the 
appropriate kernel routine. Figure 8.6 summarizes ‘the processing for a system call. 

From a programmer’ s perspective, a system call i is identical to a ‘regular func- 
tion call. However, théir implementations are quite different. Regular functions 
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(3) interrupt 


handler runs 


(3) Trap 
handler runs 
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Figure 8.7 
Fault handling. jo Chita 
" ontrol passes 
Depending on whether the (1) Current Io handlar 
fault can be repaired or instruction ‘cur 
not, the fault handler either causes a fault (3) Fault 


" n T) 
re-executes the faulting ander TUS. abor 


instruction or aborts. (4) Handler either re-executes 
current instruction or aborts 


Figure 8.8 

Abort handling. The abort 

handler passes control toa (1) Fatal hardware 
kernel abort routine that error occurs 
terminates the application 

program. 


(2) Control passes 


i to handler 
cum 


(3) Abort 
handler runs al 
> 


(4) Handler returns 
to abort routine 


run in user mode, which restricts the types of instructions they can execute, and 
they access the same stack as the calling function. A system call runs in kernel 
mode, which allows it to execute privileged instructions and access a stack defined 
in the kernel. Section 8.2.4 discusses user and kernel modes in more detail. 


Faults 


Faults result from error conditions that a handler might be able to correct. When 
a fault occurs, the processor transfers control to the fault handler. If the handler 
is able to correct the error condition, it returns control fo the faulting instruction, 
thereby re-executing it. Otherwise, the handler returns to an abort routine in the 
kernel that terminates the application program that caused the fault. Figure 8.7 
summarizes the processing for a fault. 

A classic example of a fault is the page fault exception, which occurs when 
an instruction references a virtual address whose corresponding page is not res- 
ident in memory and must therefore be retrieved from disk. As we will see in 
Chapter 9, a page is a contiguous block (typically.4 KB) of virtual memory. The 
page fault handler loads the appropriate page from disk and then returns control 
to the instruction that caused the fault. When the instruction executes again, the 
appropriate page is now resident in memory and the instruction is able to run to 
completion without faulting. 


Aborts 


Aborts result from unrecoverable fatal errors, typically hardware errors such | 
as parity errors that occur when DRAM or SRAM bits are corrupted. Abort 
handlers never return control to the application program. As shown in Figure 88, 
the handler returns control to an abort routine that terminates the application 
program. 
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Exception number Description Exception class 


0 Divide error Fault 
13 General protection fault Fault 
14 Page fault Fault 
18 Machine check Abort 
32-255 , OS-defined exceptions Interrupt or trap 


Figure 8.9 Examples of exceptions in x86-64 systems. 


8.1.3 Exceptions in Linux/x86-64 Systems 


To help make things more concrete, let's look at some of the exceptions defined 
for x86-64 systems. 'There are up to 256 different exception types [50]. Numbers 
in the range from 0 to 31 correspond to exceptions.that are defined by the Intel 
Architects and.thus are identical for any x86-64 system. Numbers in the range from 
32 to 255 correspond to' interrupts and traps that'are defined by the operating 
system. Figure 8.9 shows a few examples. 


Linux/x86-64;Faults and Aborts 


4 


Divide érror. A divide errof (exception'0) occurs when an application’ attempts 
to divide by% Zero or when the result of a divide instruction is tod big for 
the destination operand, Unix does hot attempt to recover from divide 
errors, opting instédd to abort the program. "Linux shells typically report 
divide errors as “Floating exceptions.” 


General protection fault. The infamous general protection fault (exception 13) 
occurs for many,reasons, usually because a program references an unde- 

v fined area of virtual memory or because the program attempts to, write toa 
read-only text segment, Linux does not attempt to recover from this fault. 


Linux shells typically report. general protection faults as "Segmentation 
faults." 


P4 


Page fault. A''page fault (exception T4)'is an example of an'ekception where 
the faulting instructior is restarted. Thé haiídler maps the appropriate 
page of virttial memory on disk into.a page of physical mémor arid then 
restarts the faulting instruction. We will see how page faults worK'in détail 
in Chapter 9. 


Machine check. A machine check (exception 18) occurs as a result of-a fatal 
hardware error that is detected during the execution of the faulting in- 


struction. Machine check handlers never return control to the application 
program. 


Linux/x86-64 System Calls 


Linux provides hundreds of system calls that application programs use when they 
want to request services from the kernel, such as reading a file, writing a file, and 
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Number Name 


read 
write 
open 
close 
stat 
mmap 
brk 
dup2 


Description Number Name Description 


Read file 33 pause Suspend process until signal arrives 
Write file alarm Schedule delivery of alarm signal 
Open file getpid Get process ID 

Close file fork Create process 

Get info about file execve Execute a program 

Map memory page to file .exit Terminate process 

Reset the top of the heap wait4 Wait for a process to terminate 
Copy file descriptor kill Send signal to a process 


Figure 8.10 Examples of popular system calls in Linux x86-64 systems. 


creating a new process. Figure 8.10 lists some popular Linux system calls. Each 
system call has a unique integer number that corresponds to an offset in a jump 
table in the kernel. (Notice that this jump table is not the same as the exception 
table.) 

C programs can invoke any system call directly by using the syscall function. 
However, this is rarely necessary in practice. The C standard library provides a 
set of convenient wrapper functions for most system calls. The wrapper functions 
package up the arguments, trap to the kernel with the appropriate system call 
instruction, and then pass the return status of the system call back to the calling 
program. Throughout this text, we will refer to system calls and their associated 
wrapper functions interchangeably as system-level functions. 

System calls are provided on x86-64 systems via a trapping instruction called 
syscall. It is quite interesting to study how programs can use this instruction 
to invoke Linux system calls directly. All arguments to Linux system calls are 
passéd through general-purpose registers rather than the stack. By convention, 
register %rax contains the syscall number, with up to six arguments in 4rdi, 4rsi, 
%rdx, 410, 4r8, and 4x9. The first argument is in %rdi, the second in Xrsi, and 
so on. On return from the system call, registers Arcx and %r11 are destroyed, and 
%rax contains the return value. A negative return value between.—4,095 and -1 
indicates an error corresponding to negative errno. 1 

For example, consider the following version of the familiar hello program, | 
written using the write system-level function (Section 10.4).instead of printf: 


1 int main() 

2 A 

3 write(i, "hello, world\n", 13); 
4 .exit(0); 

5  } 


The first argument to write sends the output to stdout. The second argument 3 
is the sequence of bytes to write, and the third argument gives the number of bytes 
to write. 
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1|— code/ecf/heilo-asm64.sa 


1 -Section .data 

2 string: 

3 -ascii "hello, world\n" 

4  string.end: 

5 .equ len, string end - string 
4 % 

7 

8 


Section .text 

.globl main 

main: 
First, call write(1, "hello, world\n", 13) 
movq $1, %rax write is system call 1 
movg $1, %rdi Argi: stdout has descriptor 1 
movq $string, %rsi Arg2: hello world string 
movg $len, %rdx Arg3: string length 
syscall Make the system call 


Next, call  exit(0) 

movq $60, %rax .exit is system call 60 
movq $0, Xrdi Argi: exit status is O 
syscall Make the system carl 


Ta oos code/ecf/hello-asm64.sa 


Figure'8.11 ‘Implementing the hello program directly with Linux system calls. 


Figure 8.11 shows. an assembly-language version of hello that uses the 
syscall instruction to invoke the write and exit system calls directly. Lines 
-13 invoke the write function. First, line 9 stores the number of the write Sys- 
tem call in %rax, and lines 10-12.set up the argument list..Then, line 13 uses the 
syscall instruction to invoke the system call. Similarly; lines 14-16 invoke the 
-exit system call. 
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8.2 Processes 


Exceptions are the basic building blocks that allow the operating system kernel 
to provide the notion of a process, one of the most profound and successful ideas 
in computer science. f 

When we run a program on a modern system, we are presented with the 
illusion that our program is the only one currently running in the system. Our 
program appears to have exclusive use of both the processor and the memory. 
The processor appears to execute the instructions in our program, one after the 
other, without interruption. Finally, the code and data of our program appear to 
be the only objects in the system's memory. These illusions are provided to us by 
the notion of a process. 

The classic definition of a process is an instance of a program in execution. 
Each program in the system runs in the context of some process. The context 
consists of the state that the program needs to run correctly. This state includes the 
program's code and data stored in memory, its stack, the contents of its general- 
purpose registers, its program counter, environment variables, and the set of open 
file descriptors. 

Each time a user runs a program by typing the name of an executable object 
file to the shell, the shell creates a new process and then runs the executable object 
file in the context of this new process. Application programs can also create new 
processes and run either their own code or other applications in the context of the 
new process. 

A detailed discussion of how operating systems implement processes is be- 
yond our scope. Instead, we will focus on the key abstractions that a process 
provides to the application: 





* An independent logical control flow that provides the illusion that our pro- 
gram has exclusive use of the processor. 


* A private address space that provides the illusion that our program has exclu- 
sive use of the memory system. 


Let's look more closely at these abstractions. 





8.2.1 Logical Control Flow 









A process provides each program with the illusion that it has exclusive use of the 
processor, even though many other programs are typically running concurrently 
on the system. If we were to use a debugger to single-step the execution of 
our program, we would observe a series of program counter (PC) values that 
corresponded exclusively to instructions contained in our program's executable 
object file or in shared objects linked into our program dynamically at run time. 
This sequence of PC values is known as a logical control flow, or simply logical 

| flow. 
Consider a system that.runs three processes, as shown in Figure 8.12. The 
single physical control flow of the processor is partitioned into three logical flows, 
one for each process. Each vertical line represents a portion of the logical flow for 
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Figure 8.12 Process A Process B Process C | MM 
Logical control flows. PR ORUM He uM i P | 
Processes provide each fT 770vv0vv6v€vq p vvGv6707-7 | 
program with the illusion | | : 

that it has exclusive use of Time Vedi NT I FOEDE i 4 | 
the processor. Each vertical — — | | 7777qpp000000000070070000008000077 | 

bar represents a portion of | ; 
the logical control flow for — j | vc00000000000000000000000007 I Sect | 1 i 
Qaprocess; ——— 1 — 5 SR SS SSIS THESES eed eade en cues | 


a process. In the example, the execution of the three logical flows is interleaved. b 
Process A runs for a while, followed by B, which runs to completion. Process C 1 | 
then runs for a while, followed by A, which runs to completion. Finally, C is able 1 | 
to run tò completion. Eb o| 
The key point in Figure 8.12 is that processes take turns using the processor. 1 
Éach process executes a portion of its flow and then is preempted (temporarily E | 
suspended) while other processes take their turns. To a program running in the 1 | 
context of one of these processes, it appears to have exclusive use of the proces- d | 
sor. The only evidence to the contrary is that if we were to precisely measure the E 
elapsed time of each instruction, we would notice that the CPU appears to peri- 4 l 
odically stall between the execution of some of the instructions in our program. | a | 
However, each time the processor stalls, it subsequently resumes execution of our | 1 | 
program without any change to the contents of the program’s memory locations 1 
Or registers. ! 


8.2.2 Concurrent Flows E | 


Logical flows take many different forms in computer systems. Exception handlers, 
processes, signal handlers, threads, and Java processes are all examples of logical 
flows. 

A logical flow whose execution overlaps in time with another flow is called 
a concurrent flow, and the two flows are said to run concurrently. More precisely, 
flows X and Y are concurrent with respect to each other if and only if-X begins 
after Y begins and before Y finishes, or Y begins after X begins and before X 
finishes. For example, in Figure 8.12, processes A and B run concurrently, as do 
A and C. On the other hand, B and C do not run concurrently, because the last 
instruction of B executes before.the first instruction of C. 

The. general phenomenon of multiple flows executing concurrently is known 
as concurrency. The-notion of a process taking turns with other processes is also 
known as-multitasking. Each time period that a process executes a portion of its 
flow is called a time slice. Thus, multitasking is also referred to as time slicing. For 
example, in Figure 8.12, the (low for process A consists of two time slices. 

Notice that the idea of concurrent flows is independent of the number of 
processor cores or computers that the flows are running on. If two flows overlap 
in time, then they are concurrent, even if they are running on the same processor. 
However, we willsometimes find ituseful to identify a proper subset of concurrent 
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flows known as parallel flows. If two flows are running concurrently on different 
processor cores or computers, then we say that they are parallel flows, that they 
are running in parallel, and have parallel execution. 


Process Start time End time 


A 0 2 
B 1 
C 3 5 5 


For each pair of processes, indicate whether they run concurrently (Y) or 
not (N): 


Process pair Concurrent? 


8.2.3 Private Address Space 


A process provides each program with the illusion that it has exclusive use of the 
system's address space. On a machine with n-bit addresses, the address space is the 
set of 2" possible addresses, 0, 1, . . . , 2" — 1. A process provides each program 
with its own private address space. 'This space is private in the sense that a byte 
of memory assaciated with a particular address in the space cannot in general be 
read or written by any other process. 

Although the contents of the memory associated with each private address 
space is different in general, each such space has the same general organization. 
For example, Figure 8.13 shows the organization of the address space for an x86-64 
Linux.process. 

The bottom portion of the address space is reserved for the user program, with 
the usual code, data, heap, and stack segments. The code segment always begins at 
address 0x400000. The top portion of the address space is reserved for the kernel 
(the memory-resident part of the operating system). This part of the address space 
contains the code, data, and stack that the kernel uses when it executes instructions 
on behalf of the process (e.g., when the application program executes a system 
call). 


8.2.4 User and Kernel Modes 


In order for the operating system kernel to provide an airtight process abstraction, 
the processor must provide a mechanism that restricts the instructions that an 
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application can execute, as well as the portions of the address Space that it can 
access, 

‘Processors typically provide this capability with a mode:bit in some control 
register that characterizes the privileges that the process currently enjoys. When 
the mode bit is set, ‘the process is running in kernel. mode (sometimes called 
supervisor mode). A process running in kernel mode can execute any instruction 
in the instruction set and access any memory location in the system. 

When the mode bit'is not set, the process is running in user mode. A process 
in user mode is not.allowed to execute privileged instructions that do things such 
as halt the processor; change the mode bit, or initiate an I/O operation. Nor is it 
allowed to directly reference code or data in the kernel area of the address space. 
Any such attempt results in a fatal protection fault. User programs must instead 
access kernel code and data indirectly via.the, system call interface. 

"A process running application code is initially.in user mode. The only way for 
the process to change from user mode to kernel mode is via an exception such as 
an interrupt, a fault, or'a trapping system call. When the exception.occurs, and 
control passes to the exception handler, the processor changes the móde from 
user mode to kernel mode. The handler runs in kernel mode. When it returns to 
the application code, the processor changes the mode from kernel mode back to 
user mode. 

Linux provides a clever mechanism, called the /proc filesystem, that allows 
user mode processes to access the contents of kernel data structures. The /proc 
filesystem exports the contents of many kernel data structures as a hierarchy of text 
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files that can be read by user programs. For example, you can use the /proc filesys- 
tem to find out general system attributes such as CPU type {/proc/cpuinfo), or 
the memory segments used by a particular process (/proc/process-id/maps). The 
2.6 version of the Linux kernel introduced a /sys filesystem, which exports addi- 
tional low-level information about system buses and devices. 


8.2.5 Context Switches 


The operating system kernel implements multitasking using a higher-level form 
of exceptional control flow known as a context switch. The context switch mecha- 
nism is built on top of the lower-level exception mechanism that we discussed in 
Section 8.1. 

The kernel maintains a context for each process. The context is the state 
that the kernel needs to restart a preempted process. It consists of the values 
of objects such as the general-purpose registers, the floating-point registers, the 
program counter, user’s stack, status registers, kernel’s stack, and various kernel 
data structures such as a page table that characterizes the address space, a process 
table that contains information about the current process, and a file table that 
contains information about the files that the process has opened. 

At certain points during the execution of a process, the kernel can decide 
to preempt the current process and restart a previously preempted process. This 
decision is known as scheduling and is handled by code in the kernel, called the 
scheduler. When the kernel selects a new process to run, we say that the kernel 
has scheduled that process. After the kernel has scheduled a new process to run, 
it preempts the current process and transfers control to the new process using a 
mechanism called a context switch that (1) saves the context of the current process, 
(2) restores the saved context of some previously preempted process, and (3) 
passes control to this newly restored process. 

A context switch can occur while the kernel is executing a system call on behalf 
of the user. If the system call blocks because it is waiting for some event to occur, 
then the kernel can put the current process to sleep and switch‘to another process. 
For example, if a read system call requires a disk access, the kernel can opt to 
perform a context switch and run another process instead of waiting for the data 
to arrive from the disk. Another example is the sleep system call, which is an 
explicit request to put the calling process to sleep. In general, even if a system 
call does not block, the kernel can decide to perform a context switch rather than 
return control to the calling process. 

A context switch can also occur as a result of an interrupt. For example, all 
systems have some mechanism for generating periodic timer interrupts, typically 
every 1 ms or 10 ms. Each time a timer interrupt occurs, the kernel can decide that 
the current process has run long enough and switch to a new process. 

Figure 8.14 shows an example of context switching between a pair of processes 
A and B. In this example, initially process A is running in user mode until it.traps to 
the kernel by executing a read system call. The trap handler in the kernel requests 
a DMA transfer from the disk controller and arranges for the disk to interrupt the 
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Figure 8.14 ` bdo 2g 
Anatomy of a process Time Process A | Process Br 
context switch. " i 
AEN User code 
read e=» Sn Y Conte 
Kernel code } switch 
Disk interrupt ==» occiso BOT code Context 
Return ==» s... EEL code f switch 
from read“ User code 





We. A6 z r ? 


prócessor after thé disk controller hás finished transferring thé data ftom disk to 
memory. i i 

The disk will take a' relatively long time to fetch the'dafa (on the ordér of tens 
of milliseconds),-so instead of waiting and doing notliingin the interim, the kernel 
performs a context switch from process A to B. Note that, before the switch, the 
kernel is executing instructions in user mode on behalf of process A (i. e, there 
is no separate kernel process). During the first part of the switch, “the kernel is 
executing instructions in kernel mode on behalf of process A. Then at some point 
it begins executing instructions (still in kernel mode) on behalf of process B. And 
after the switch, the kernel is executing instructions in yser mode on behalf of 
process B. 

Process B then runs for a while in user mode until the disk sends an interrupt 
to signal that data have been transferred from disk to memory. The kernel decides 
that process B has run long enough and performs a context switch from process B 
to A, returning control in process A to the instruction immedjately-following the 
read system call. Process A continues to run until the next exception occurs, and 
so on. 


i aj l af 


8.3 System Call Error Handling 23 
T oa, + at 
When Unix system-level functions, Satanae an' error, they typically return —1 
and set the.global integer variable errpo to indicate what went wrong. Program- 
mers should always check for errors, but ynfortunately, many skip error checking 
because it bloats the code and makes it harder-to read. For example, here is how 
we might check for errors when we call the Linux fork function: 
a t H pr 

if ((pid = fork()) < 0) { 

fprintf(stderr, "fork error: %s\n", strerror(errno)); 

exit (0); ` 


à wN- 


} 
ì 


Thé Strerror function returns a text string that describes.the error associated 
with a particular value of errno. We can simplify this code somewhat by defining 
the following error-reporting function: 
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1 void unix error(char *msg) /* Unix-style error */ 

2 (t 

3 fprintf(stderr, "Ys: s\n", msg, strerror (errno) ) ; 
4 exit (0); 

5 } 


Given this function, our call to fork reduces from four lines to two lines: 


1 if ((pid = forkO) < 0) 
2 unix_error("fork error"); 


We can simplify our code even further by using error-handling wrappers, 
as pioneered by Stevens in [110]. For a given base function foo, we define a 
wrapper function Foo with identical arguments but with the first letter of the name 
capitalized. The wrapper calls the base function, checks for errors, and terminates 
if there are any problems. For example, here is the error-handling wrapper for the 
fork function: 


.pid.t Fork(void) 
1 
pid t pid; 


unix error("Fork error"); 
return pid; 


1 
2 
3 
4 
5 if ((pid = forkO) < 0) 
6 
7 
8 


} 
Given this wrapper, our call to fork shrinks to a single compact line: 
1 pid = Fork(); 


We will use error-handling wrappers throughout the remainder of this book. 
They allow us to keep our code examples concise without giving you the mistaken 
impression that it is permissible to ignore error checking, Note that when we 
discuss system-level functions in the text, we will always refer to them by their 
lowercase base names, rather than by their uppercase wrapper names. 

See Appendix A for a discussion of Unix error handling and the error- 
handling wrappers used throughout this book. The wrappers are defined in a file 
called csapp. c, and their prototypes are defined in a header file called csapp.h. 
These are available online from the CS:APP Web site. 


8.4 Process Control 


Unix provides a number of system calls for manipulating processes from C pro- | 
grams. This section describes the important functions and gives examples of how 
they are used. 
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| 8.4.1 Obtaining'^Process IDs 


| Each process has a unique positive (nonzero) process ID (PID). The getpid 
function returns the PID of the calling process. The getppid function returns the 
PID of its parent (i.e., the process that created the calling process). 


| #include <sys/types.h> 
#include <unistd.h> 


pid_t getpid(void) ; 
pid_t getppid(void); i 


di Returns: PID of either'thé caller of the parent 
i [ D. 





È 22 3» !' 





The getpid and getppid routines return an integer value of type pid. t, which on 
Linux systems is defined‘in,types.h as an int, 


al 


8.4.2 Creating and Terminating Processes 


From a programmer’s perspective, we can think of a,process as being in one of 
three states: ;] 


i F 


Running. The process is either executing on the CPU or waiting to be executed 


and will eventually be schéduled by the kernel. | 1 
E z 1 
Stopped. The execution of the process is suspended and wil] not be scheduled. 


A process stops as a result of receiving a SIGSTOP, SIGTSTP, SIGTTIN, r: 
or SIGTTOU signal, and it remains stopped until it receives a SIGCONT L 
signal, at which point it becomes running again. (A signal.is a form of | 
software interrupt that we will describe in detail in Section 8.5.) 


Terminated. The process is stopped permanently.,A process becomes termi- 
nated for one of three reasons: (1) receiving a signal whose default action 
is to terminate the process, (2) returning from the main routine, or (3) 
calling the exit function. es 
i 


4 i 


#include <stdlib.h> 


+a . . s [ 
void exit(int status); 
ni 4 r 


This function does not return 





oh 


t 
The exit function terminates the process-with an exit status of status. (The other 1 
way to set the exit status is to return an integer value from the main routine.) 
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A parent process creates a new running child process by calling the fork 
function. 


#include <sys/types.h> 
#include <unistd.h> 


pid_t fork(void); 
Returns: 0 to child, PID of child to parent, —1 on error 


The newly created child process is almost, but not quite, identical to the parent. 
The child gets an identical (but separate) copy of the parent's user-level virtual 
address space, including the code and data segments, heap, shared libraries, and 
user stack. The child also gets identical copies of any of the parent's open file 
descriptors, which means the child can read and write any files that were open in 
the parent when it called fork. The most significant difference between the parent 
and the newly created child is that they have different PIDs. 

The fork function is interesting (and often confusing) because it is called once 
but it returns twice: once in the calling process (the parent), and once in the newly 
created child process. In the parent, fork returns the PID of the child. In the child, 
fork returns a value of 0. Since the PID of the child is always nonzero, the return 
value provides an unambiguous way to tell whether the program is executing in 
the parent or the child. 

Figure 8.15 shows a simple example of a parent process that uses fork to create 
a child process. When the fork call returns in line 6, x has a value of 1 in both the 
parent and child. The child increments and prints its copy of x in line 8. Similarly, 
the parent decrements and prints its copy of x in line 13. 

When we run the program on our Unix system, we get the following result: 


linux» ./fork 
parent: x=0 
child : x-2 


There are some subtle aspects to this simple example. 


Call once, return twice. The fork function is called once by the parent, but it 
returns twice: once to the parent and once to the newly created child. 
This is fairly straightforward for programs that create a single child. But 
programs with multiple instances of fork can be confusing and need to 
be reasoned about carefully. 


Concurrent execution. The parent and the child are separate processes that 
run concurrently. The instructions in their logical control flows can be 
interleaved by the kernel in an arbitrary way. When we run the program 
on our system, the parent process completes its printf statement first, f 
followed by the child. However, on another system the reverse might be į 
true. In general, as programmers we can never make assumptions about 4 
the interleaving of the instructions in different processes. 
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code/ecfifork.c 

1 int main() 
2 (t 
3 pid.t pid; 
4 int x = 1; 
5 
6 pid = Fork(); 
7 if (pid == 0) { /* Child */ 
8 printf("child : x=%d\n", ++x); 
g exit(0); 
10 } 
" 
12 /* Parpnt af 
13 printf ("parent: x=%d\n", --x); 
14 exit(0); 
5 } 

a code/ecf/fork.c 


Figure 8.15 Using fork to create a new process. 


Duplicate but separate address spaces. If we could halt both the parent and the 
child immediately after the fork function returned in each process, we 
would see that the address space of'each process is identical. Each process 
has the same user stack, the same loca] variable values, the same heap, 
the same global variable values, and the same code. Thus, in our example 
program, local variable x has a value of 1 in both the parent and the child 
when, the fork function returns in line 6. However, since the parent and 
the child are separate processes, they each have their own private address 
spaces. Any subsequent changes that a parent or child makes to x are 
private and are not reflected in the memory of the other process. This is 
why the variable x has different values in the parent and child when they 
call their respective printf statements. " 


Shared files. When we run the example program, we notice that both pareht and 
child print their output on the séreen. The reason is that the child inhetits 
all of the parent's open files. When the parent calls fork, the stdout file 
is open and directed to the screen. The child inherits this file, and thus its 
output is also directed to the screén. d 


|». When yoy are first dearning about the fork function, it is often helpful to 
sketch the process graph, which is a simple kind of précedence graph that captures 
the partial ordering of program statements. Each vertex a corresponds to the 
execution of a program statement. A directed edge a — b denotes that statement 
a "happens before" statement b. Edges can be labeled with information such as 
the current value of.a variable, Vertices corresponding to printf statements can 
be labeled with the output of the printf. Each graph begins with a vertex that 
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Figure 8.16 


child: x=2 


Process graph for the printf 


example program in 


Figure 8.15. 


parent: x^0 


fork printf 


int main() 
{ ForkO printf 
ork(); 
hel 
ForkO ; eas 
printf ("hello\n"); 
exit(0); 
printf 
hella 


main fork fork 


Figure 8.17 Process graph for a nested fork. 


corresponds to the parent process calling main. This vertex has no inedges and 
exactly one outedge. The sequence of vertices for each process ends with a vertex 
corresponding to d call to exit. This vertex has one inedge and no outedges. 

For example, Figure 8.16 shows the process graph for the example program in 
Figure 8.15. Initially, the parent sets variable x to 1. The parent calls fork, which 
creates a child process that runs concurrently with the parent ih its own private 
address space. 

For a program running on a single processor, any topological sort of the 
vertices in the corresponding process graph represents a feasible total ordering 
of the statements in the program. Here's a simple way to understand the idea of 
a topological sort: Given some permutation of the vertices in the process graph, 
draw the sequence of vertices in a line from left to right, and then draw each ofthe 
directed edges. The permutation is a topological sort if and only if each edge in 
the drawing goes from left to right. Thus, in our example program in Figure 8.15, 
the printf statements in the parent and child can occur in either order because ; 
each of the orderings corresponds to some topological sort of the graph vertices. 

The process graph can be especially helpful in understanding programs with 
nested fork calls. For example, Figure 8.17 shows a program with two calls to fork 
in the source code. The corresponding process graph helps us see that this program 
runs four processes, each of which makes a call to printf and which can execute 
in any order. 
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code/ecf/forkprob0.c f 8 i 
1 int mainO i i | 
2 1 | d | 
| 3 int x = 1; il 
= $ E 
EE if (ForkO == 0) | 3 
6 printf ("p1: x=4d\n", ++x); i | 
i printf("p2; x=%d\n", --3); i 
8 exit(0); 
9 } 3 l 
code/ecf/forkprob0.c | § 


A. What is the output of the child process? i 
B. What is the output of the parent process? 


| 

] 

| 

8.4.3 Reaping Child Processes ; 

at k: 
When a process terminates for any reason, the kernel does not remove it from 
the'system immediately. Instead, the process is kept around in a terminated state 
until it is reaped by its parent. When the parent reaps the terminated child, the 
kernel passes the child’s exit status to the parent and then discards the terminated 
process, at which point it ceases to exist. A terminated process that has not yet 
been reaped is called a zombie. l : 

When a parent process terminates, the kernel arranges for the init process 
| to become the adopted parent of any orphaned children. The init process, which 
| has a PID of 1, is created by, the kernel during system start-up, never terminates, 
|  andisthe ancestor of every process. If a parent process terminates without reaping | 
| its zombie children, then the kernel arranges for the init process to reap them. E ff 

However, long-running programs such as shells or servers should always reap their 
zombie children. Even though zombies are not running, they still consume system 1 
memory resources. | 

A prócéss waits for its childreh to términate or stop by calling the waitpid i 

function. | 
1 


#include <sys/types.h> 
#include <sys/wait,.h> 


pid t Waitpid(pid_t pid, int *statusp, int options); 
Returns: PID of child if OK, 0 (if WNOHANG), or —1 on error 
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Aside, Why are terminated children called zombies? 


In folklore, a zombie is a-living corpse, an entity that is half alive and half dead. A zombie process is 
similar in the sense that althotglrit has already terminated, the kernel maintains some of its state until 
it can be reaped by the parent: 


The waitpid function is complicated. By default (when options = 0), 
waitpid suspends execution of the calling process until a child process in its wait 
set terminates. If a process in the wait set has already terminated at the time of the 
call, then waitpid returns immediately. In either case, waitpid returns the PID of 
the terminated child that caused waitpid to return. At this point, the terminated 
child has been reaped and the kernel removes all traces of it from the system. 


Determining the Members of the Wait Set 


The members of the wait set are determined by the pid argument: 


e If pid > 0, then the wait set is the singleton child process whose process ID is 
equal to pid. 
* If pid = -1, then the wait set consists of all of the parent’s child processes. 


The waitpid function also supports other kinds of wait sets, involving Unix pro- 
cess groups, which we will not discuss. 


Modifying the Default Behavior 


The default behavior can be modified by setting options to various combinations 
of the WNOHANG, WUNTRACED, and WCONTINUED constants: 


WNOHANG. Return immediately (with a return value of 0) if none of the 
child processes in the wait set has terminated yet. The default behavior 
suspends the calling process until a child terminates; this option is useful 
in those cases where you want to continue doing useful work while waiting 
for a child to terminate. 


WUNTRACED. Suspend execution of the calling process until a process in the 
wait set becomes either terminated or stopped. Return the PID of the 
terminated or stopped child that caused the return. The default behavior 
returns only for terminated children; this option is useful when you want 
to check for both terminated and stopped children. c 


WCONTINUED. Suspend execution of the calling process until a running 
process in the wait set is terminated or until a stopped process in the wait 
set has been resumed by the receipt of a SIGCONT signal. (Signals are 
explained in Section 8.5.) 


You can combine options by oring them together. For example: 
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* WNOHANG | WUNTRACED: Return immediately, with a return value of 
0, if none of the children in the wait set has Stopped or terminated, or with a 
return value equal to the PID of one of the stopped or terminated children. 


Checking the Exit Status of a Reaped Child 


If the statusp argument is non- NULL, then waitpid encodes status information 
about the child that caused the return in status, which is the value pointed to 
by statusp. The wait.h include file defines several macros for interpreting the 


status argument: 


WIFEXITED(status). Returns trueif the child terminated normally, via a call 
to exit or a return. 


WEXITSTATUS(status). Returns the exit status of a normally terminated 
child. This status is only defined if WIFEXITED() returned true. 


WIFSIGNALED(status). Returns true if the child process terminated be- 
cause of a signal that was not caught. 


WTERMSIG(status). Returns the number of the signal that caused the child 
process to terminate. This status is only defined if WIFSIGNALED() 
returned true. l 


WIFSTOPPED(status). Returns true if the child that caused the return is 
currently stopped. 


WSTOPSIG(status). Returns the number of the signal that caused the child 
to stop. This status is only defined if WIFSTOPPED() returned true. 


WIFCONTINUED(status). Returns true if the child process was restarted by 
receipt of a SIGCONT signal. 


Error Conditions 


If the calling process has no children, then waitpid returns —1 and sets errno to 
ECHILD. If the waitpid function was interrupted by a signal, thén it returns —1 
and sets errno to EINTR. 





ADE USI RI Dy es ke aera 
A iy, J ee 


List all of the possible output sequences for the following program: “ 


Too — —— — — code/ecf/waitprobÜ.c 


1 int main() : 

2 4 

3 if'(ForkO == 0) { i 

4 printf("a"); fflush(stdout); 4 T 
5 } 

6 else { 


745 
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7 printf("b"); fflush(stdout) ; 

8 waitpid(-1, NULL, 0); 

9 n | 
10 printf("c"); fflush(stdout); | 
11 exit(0); | 
i3. 3 | 





ee codefecf/waitprobÜ.c 





The wait Function 


The wait function is a simpler version of waitpid. 


#include <sys/types.h> 
#include <sys/wait .h> 


pid_t wait(int *statusp); 


Returns: PID of child if OK or —1 on error 





Calling wait (&status) is equivalent to calling waitpid(-1, Estatus, 0). 


Examples of Using waitpid 


Because the waitpid function is somewhat complicated, it is helpful to look at 
a few examples. Figure 8.18 shows a program that uses waitpid to wait, in no 
particular order, for all of its N children to terminate. In line 11, the parent creates 
each of the N children, and in line 12, each child exits with a unique exit status. 


| 
| 
| 


Aside Constants associated with Unix functions 


Constants such as WNOHANG and WUNTRACED are defined-by system header files. For example, 
WNOHANG and WUNTRACED are defined (indirectly) bythe wait .h header file: 
i ri 


mM € ami Gallis! m 


/* Bits in the third argument: to 'waitpid'. */ 
#define WNOHANG 1 / Don't block waiting. */ : 
#define WUNTRACED 2 /* Report status,of stopped ,children. */ 


bine 


In order to use these constants, you.must include the wait,.h header file in your.code? 


#include <sys/wait.h> 


The man page for each Unix function lists the header files to include whenever you use that function 
in your code. Also, in order to check return codes'such as ECHILD and EINTR, you must include 
errno.h. To simplify our.code examples, we include a single header file called csapp.h that includes 
the header files for all of the functions used in the book. The csapp,h header file is available online 
from the CS:APP Web site. l 


ume n 
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code/ecf/waitpidl.c 

1 #include "csapp.h" 
2 #define N 2 
3 it 
4 int main() 
s ( d 
6 int status, i; 
7 pid_t pid; i 
8 
9 /* Parent creates N children */ i 
10 for (i2 0; i < N; i++) 
11 if ((pid = ForkO) == 0) /* Child */ 
12 : exit(100*i); 
13 ' 
14 sd /* Parent reaps N children in no particular order */ 
ipo while ((pid = waitpid(- 1, &status, 0)) »"0) ( ' : 
16 if (WIFEXITED(status)) 
17 printf("child 4d terminated normally with exit statuas AdNn" , 
18 pid, WEXITSTATUS(status)); 
19 else 
20 printf("child %d terminated abnormally\n", pid); 
21 } 
22 
23 /* The only normal termination is if there are no more children */ 
24 if (errno != ECHILD) 
25 unix error("waitpid error"); 
26 
27 exit(0); 
28 } 

- code/ecf/waitpidl.c 


Figure 8.18 Using the waitpid function to reap'zombie children in no particular order. 


Before moving on, make sure you understand why line 12 is executed by each of 
the children, but not the parent. 

Inline 15, the parent waits for all of its children to terminate by using waitpid 
as the test condition of a while loop. Because the first argument is —1, the call to 
waitpid blocks until an arbitrary child has terminated. As each child terminates, 
the call to waitpid returns with the nonzero PID of that child. Line 16 checks the 
exit status of the child. If the child terminated normally—in this case, by calling 
the exit function—then the parent extracts the exit status and prints it on stdout. 

When all of the children have been reaped, the next call to waitpid returns —1 
and sets errno to ECHILD. Line 24 checks that the waitpid function terminated 
normally, and prints an error message otherwise. When we run the program on 
our Linux system, it produces.the following output: 
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linux» ./waitpidi 
child 22966 terminated normally with exit status=100 
child 22967 terminated normally with exit status=101 


Notice that the program reaps its children in no particular order. The order that 
they were reaped is a property of this specific computer system. On another 
system, or even another execution on the same system, the two children might 
have been reaped in the opposite order. This is an example of the nondeterministic 
behavior that can make reasoning about concurrency so difficult. Either of the two 
possible outcomes is equally correct, and as a programmer you may never assume 
that one outcome will always occur, no matter how unlikely the other outcome 
appears to be. The only correct assumption is that each possible outcome is equally 
likely. 

Figure 8.19 shows a simple change that eliminates this nondeterminism in the 
output order by reaping the children in the same order that they were created by 
the parent. In line 11, the parent stores the PIDs of its children in order and then 
waits for each child in this same order by calling vaitpid with the appropriate 
PID in the first argument. 


Dun devis OIE ODEN KE Te SP PORE GIO TOLLE en Pe xen M uo den t 
d 


Practice Problem a gluten ndag 297): HERI Eum 


Consider the following program: 





code/ecf/waitprobl.c 


int main() 

{ 
int status; 
pid_t pid; 


printf ("Hello\n") ; 

pid = ForkO; 

printf("%d\n", !pid); 

if (pid != 0) { 
if (waitpid(-1, &status, 0) > 0) (1 

if (WIFEXITED(status) != 0) 
printf('"4dNn", WEXITSTATUS (status)); 

} 

} 

printf ("ByeWNn"); 

exit(2); 


A. How many output lines does this program generate? 


B. What is one possible ordering of these output lines? 
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a OEE, a NCE I i — ——————— code/ecf/waitpid2.c 
Ll 


include "csapp.h" 
#défine N 2 n" 


m ^ 


«d 


int main() 
{ 
int status, i; 
pid t pid[N], retpid; 


wo ON HAW BW NSS 


/* Parent creates N children */ 
for (i = 0; i < N; i++) 
if ((pidli] = ForkO) == 0) /* Child */ 
exit (100+i); 
N 


m 
e 


= m = 
w N — 


/* Parent reaps N children in order */ 
i20; 
while .((retpid = waitpid(pid[it+], &status, 0) > 02 ( 
if (WIFEXITED(status)) |, ig 
printf("child %d terminated normally with exit status=%d\n", 
retpid, WEXITSTATUS(status)) ; $ 


— 
A^ 


- — ot 
NO Q 


— m 
w0 œ 


else 
printf("child %d terminated abnormally\n", retpid); 


NON ON 
N- 0 


P 


ho hs 
Aou 


/* The only,normal termination is if thererare no more children */ 
if (errno != ECHILD) 
unix error("waitpid error"); 


N NN 
N DO in 


exit(0); 


t2 N 
'O 0 


code/ecf/waitpid2.c 


Figure 8.19 Using waitpid tò reap zombie children in the order they were created. 


h 
8.4.4 Putting Processes to, Sleep 


Thé sleep function suspends a process for a specified period of time. 
j ag 


1 E 


#include «unistd'.h» ' 


unsigned int sleep(unsigned int secs); 


` mci 
a i Returns: seconds left to sleep 


i C i A u 
Sleep returns zero if the requested amount of time has elapsed,and the number of 
seconds still left to sleep otherwise. The latter case is possible if the sleep function 


j 
] 
: 
5 
iq 
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returns prematurely because it was interrupted by a signal. We will discuss signals 
in detail in Section 8.5. 

Another function that we will find useful is the pause function, which puts the 
calling function to sleep until a signal is received by the process. 


#include <unistd.h> 


int pause (void) ; 
Always returns -1 


Fei 7m umet pt seo Uo x89 (so POE i ied clip AL apnd feet sug iio i ANNE 
Write a wrapper function for sleep, called snooze, with the following interface: 


unsigned int snooze(unsigned int secs); 


The snooze function behaves exactly as the sleep function, except that it prints 
a message describing how long the process actually slept: 


Slept for 4 of 5 secs. 


8.4.5 Loading and Running Programs 


The execve function loads and runs a new program in the context of the current 
process. 


finclude <unistd.h> 


int execve(const char *filename, const char *argv n, 
const char *envp[]); 
Does not return if OK; returns —1 on error 


The execve function loads and runs the executable object file filename with the 
argument list argv and the environment variable list envp. Execve returns to the 
calling program only if there is an error, such as not being able to find filename. 
So unlike £ork, which is called once but returns twice, execve is called once and 
never returns. 

The argument list is represented by the data structure shown in Figure 8.20. 
The argv variable points to a null-terminated array of pointers, each of which 
points to an argument string. By convention, argv [0] is the name of the executable 
object file. The list of environment variables is represented by a similar data į 
structure, shown in Figure 8.21. The envp variable points to a null-terminated array 
of pointers to environment variable strings, each of which is a name-value pair of 
the form name-value. ” | 
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Figure 8.20 . : argv (J 
Organization of an 3 
argument list CIE | 
Us cod g : 
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1 

| 

Figure 8.21 envpl] j 
Organization of an, | 







environment variable list. 






After execve loads filename, it calls the start-up code described in Sec- E 
tion 7.9. The start-up code sets up the stack and passes control to the main routine | d 
of the new program, which has a prototype of the form 


int main(int argc, char **argv, char **envp); 
or equivalently, 
int main(int argc, char *argv[], char *envp(1); | 


When main begins executing, the user stack has the organization shown in Fig- | 1 
ure 8.22. Let's work our way from the bottom of the stack (the highest address) a 
to the top (the lowest address). First are the argumént-and environment strings. 
These are followed further up the stack by a null-terminated array of pointers, à 
each of which points to an environment variable string orr the stack. The global a 
yariable environ points to the first of these pointers, envp [0]. The environment D. 
array is followed by the null-terminated arg [] array, with gach element pointing a 
to an argument string on the stack. At the.top of the stack, is the stack frame for | d 
the system start-up function, libc. start main (Section 7.9). | 

There are three arguments to function main, each stored in a register accord- | 
ing to the x86-64 stack discipline: (1) argc, which gives the number of non-null | 
pointers in the argv [] array; (2) argv, which points to the first entry in the argv [] q 
array; and (3) envp, which points to the first entry in the envp[] array. | P 

Linux provides several functions for manipulating the environment array: | d 


#include <stdlib.h> 


char *getenv(const char *name); 
ig 
Returns: pointer to name if it exists, NULL if no match | 
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The getenv function searches the environment array for a string name=value. If 
found, it returns a pointer to value; otherwise, it returns NULL. 


#include <stdlib.h> 


int setenv(const char *name, const char *newvalue, int overwrite); 
Returns:.0 on success, —1 on error 


void unsetenv(const char *name); 


t 





1 


If the environment array contains. a string of the form 'náne-oldvalue, then 
unsetenv déletés it and setenv replaces oldvalite' with newvalue, but only if 
overwrite is'Iifonzero. If name doe’ not exist, teh setenv adds name=néwvaiue 


to the array. 










LR Ree 


arguments and envi- 


Ü 










Write a program ca ts its command-line 
ronment variables. For example: eo" 


linux» ./myecho argi arg2 
Command-ine arguments: 
argv[ 0]: myecho 
argv[ 1]: argi 
argv[ 2]: arg2 
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Environment variables: 
envp[ 0]: PWD=/usr0/droh/ics/code/ecf 
envp[ 1]: TERM=emacs 


envp[25]: USER-droh 
envp[26]: SHELL-/usr/local/bin/tcsh 
envp[27]: HOME=/usr0/droh 





8.4.6 Using fork and execve to Run Programs 


Programs such as Unix shells and Wéb servers make heavy use of the fork and 
execve functions. A shell is an interactive application-level program that runs 
other programs on behalf of the user. The original shell was the sh program, 
which was followed by variants such as csh, tcsh, ksh, and bash. A shell performs 
a sequence of read/evaluate steps and then terminates. The read step reads a 
command line from the user. The evaluate step parses the command line and runs 
programs on behalf of the user. 

Figure 8.23 shows the main routine of a simple shell. The shell prints a 
command-line prompt, waits for the user to type a command line on stdin, and 
then evaluates the command line. 

Figure 8.24 shows the code that evaluates the command line. Its first task is 
to call the parseline function (Figure 8.25), which parses the space-separated 
command-line arguments and builds the argv vector that will eventually be passed 
to execve. The first argument is assumed to be either the name of a built-in shell 
command that is interpreted immediately, or an executable object file that will be 
loaded and run in the context of a new child process. 

If the last argument is an *&' character, then parseline returns 1, indicating 
that the program should be executed in the background (the shell does not wait 
for it to complete). Otherwise, it returns 0, indicating that the program should be 
run in the foreground (the shell waits for it to complete). 


wes vy Mem Eo qe aR Fowe ys 5 Hoppe G" PU QYy Cap yk cep ghe age d eA, oí dE 
Aside, Programs versus processes: ,. « ecodwo as x 4 gn 


This is a good place to.pause and make sure you understand the distinction between a program and 
à process. A progranris q collection of code and data; progránis caf "éxist.as dbject files on disk or as 
segmients iri an’addréss Space. A*process isa specific instáhce Offa’ progranit’ in ékecution; a program 
‘always runs'‘ih.theGontext óf Some process. Uidérstanding this distinction is important if you want to 
understand thée*fork and exócve Tupétions: The’#é¥x furictión ruins the same program in a new child 
‘process that*is a duplicate of the parent. The’ execve’ funétion! [odds ‘ahd risa new program in the 
context of the current process. While it overwrites the address Space of the current process, it does not 
create a new procéss. The new prograim still has the sanie PID; abd it inherits'all of the file descriptors 
that were open at'the time of thé call tó the execve function. P= 

ES o9 & 
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So — ——- codefecf/shellex.c 


1 #include "csapp.h" 

2 #define MAXARGS 128 

3 

4 /* Function prototypes */ 

5 void eval(char *cmdline); 

6 int parseline(char *buf, char **argv) ; 
7 int builtin command(char **argv) ; 

8 

9 int main() 

10 1 

11 char cmdline[MAXLINE]; /* Command line */ 
12 

13 while (1) { 

14 /* Read */ 

15 printf("» "); 

16 Fgets(cmdline, MAXLINE, stdin); 
17 if (feof(stdin)) f 
18 exit (0); 

19 

20 /* Evaluate */ 

21 eval (cmdline) ; 

22 } 

23. ë } 


o aa 


Figure 8.23 The main routine for a simple shell program. 


After parsing the command line, the eval function calls the builtin_command 
function, which checks whether the first command-line argument is a built-in shell 
command. If so, it interprets the command immediately and returns 1. Otherwise, 
it returns 0. Our simple shell has just one built-in command, the quit command, 
which terminates the shell. Real shells have numerous commands, such as pwd, 
jobs, and fg. 
| If builtin command returns 0, then the shell creates a child process and 
i executes the requested program inside the child. If the user has asked for the 
program to run in the background, then the shell returns to the top of the loop and 
waits for the next command line. Otherwise the shell uses the vaitpid function 
to wait for the job to terminate. When the job terminates, the shell goes on to the 
next iteration. 

Notice that this simple shell is flawed because it does not reap any of its 
background children. Correcting this flaw requires the use of signals, which we 
describe in the next section. 


"-———— i "1 





Section 8.4 Process Control 


TT a sam — — —- codefecf/shellex.c 
1  /* eval - Evaluate a command line */ 
2 void eval(char *cmdline) 
3 1 
4 char *argv[MAXARGS]; '/* Argument list execve() */ 
5 char buf [MAXLINE]; /* Holds modified command line */ 
6 int bg; /* Should the job run in bg or fg? */ 
7 pid t pid; /* Process id */ 
8 - 
9 strcpy(buf, cmdline); ,n eT 
10 bg - parseline(buf, argv); 
n if (argv[0] == NULL) 
12 return; /* Ignore empty lines */ 
13 
14 if (!builtin command(argv)) { ( 
if ((pid = ForkO) == 0) (  /* Child runs user=job */ 
if (execve(argv[0], argv, environ) < 0) { 
printf("%s: Command not found.\n", argv[0]); 
exit (0); 


} 


/* Parent waits for foreground job to terminate */ 
if (!bg) ( 
int status; 
if (waitpid(pid, &status, O) « O) 
unix error("waitfg: waitpidcerror"); 
} 
else 
printf("%d %s", pid, cmdline); 
} 
return; 


} 


/* If first arg is a builtin command, run it and return «fue */ 
int builtin command(char **argv) 
1 
if (!stremp(argv[0], "quit")) /* quit command */ 
exit(0); 
if (!stromp(argv[0], "&")) /* Ignore singleton & */ 
‘return 1; ^ 
return 0; /* Not a builtin command */ 


42 } 


ta 


Tosa — Ie codetecf/shellex.c 


xd 


Figure 8.24 eval evaluates the shell command line. 
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o o o —— ——— ccodelecfishellerc 


1 /* parseline - Parse the command line and build the argv array */ 
2 int parseline(char *buf, char **argv) 

3 1 

4 char *delim; /* Points to first space delimiter */ 

5 int argc; /* Number of args */ 

6 int bg; /* Background job? */ 

7 

8 

9 


buf[strlen(buf)-1] = ' '; /* Replace trailing "An! with space */ 
while (*buf &£ (*buf == ' ')) /* Ignore leading spaces x/ 
buft+; 


/* Build the argv list */ 
argc = 0; 
while ((delim = strchr(buf, ' '))) { 
argv[argct*] = buf; 
*delim = '\0'; 
buf = delim'+ 1; 
while (*buf && (*buf == ' ')) /* Ignore spaces */ 
buf++; 
} 
argv [argc] = NULL; 


if (argc == 0) /* Ignore blank line */ 
return 1; 


/* Should the job run in the“background? */ 
if ((bg = (*argv[argc-i] == '&')) !- 0) 
argv[--argc] - NULL; 


return bg; 


+ 


ee ae code/ecf/shellex.c 


Figure 8.25 parseline parses a line of input for the shell. 


8.5 Signals 


To this point in our study of exceptional control flow, we have seen how hardware 
and software cooperate to provide the fundamental low-level exception mecha- 
nism. We have also seen how the operating system uses exceptions to support a 
form of exceptional control flow known as the process context switch. In this sec- 
tion, wé will study a higher-level software form of exceptional control flow, known 4 
as a Linux signal, that allows processes and the kernel to interrupt other processes. | 





Number Name 

(d SIGHUP 

2 SIGINT 

3 SIGQUIT 
4 SIGILL 

5 SIGTRAP 
6 SIGABRT 
7 SIGBUS 

8 SIGFPE 

9 SIGKILL 
10 SIGUSR1 
1 SIGSEGV 
12 SIGUSR2 
13 SIGPIPE 
14 SIGALRM 
15 SIGTERM 
16 SIGSTKFLT 
17 SIGCHLD 
18 : SIGCONT 
19 SIGSTOP 
20 SIGTSTP 
21 SIGTTIN 
22 SIGTTQU 
23 SIGURG 
24 SIGXCPU " 
25 SIGXFSZ 
26 SIGVTALRM 
27 SIGPROF " 
28 SIGWINCH 
29 SIGIO 

30 SIGPWR 


Default action 


Terminate 

Terminate n 
Terminate 

Terminate — , 4 
Terminate and dump core® 
Terminate and dump core ° 
‘Terminate 

Teflninate and dump core? 
Terminate? 

Terminate 

Terminate and dump core® 
Terminate 

Terminate 

Terminate 

Terminate 

Terminate 

Ignore 

Ignore 

Stop until next SIGCONT® 
Stop until next SIGCONT 
Stop until next SIGCONT 
Stop until next SIGCONT 
Ignore 

Terminate 

Terminate 

Terminate 

Terminate 

Ignore 

Terminate 

Terminate 


ry 
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Correspondirlg event 


Terminal line hangup 
Interrupt from keyboard 

Quit from keyboard 

Illegal instruction 

Trace trap 

‘Abort signal from abort function 

Bus error 

Floating-point exception 

Kill program 

User-defined signal 1 

Invalid memory reference (seg fault) 
User-defined signal 2 

Wrote to a pipe with no reader 

"Timer signal from alarm function 
Software termination signal 

Stack fault on coprocessor 

A child process has stopped or terminated 
Continue process if stopped 

Stdp signal not from terminal 

Stop signal from terminal 
Background process read from terminal 
Background process wrote to terminal 
Urgent condition on socket 

CPU time limit exceeded 

File size limit exceeded 

Virtual timer expired 

Profiling timer expired 

Window size changed 

I/O now possible on a descriptor 
Power failure 





Figure 8.26 Linux signals. Notes: (a) Years ago, main memory was implemented with a technology known 
as core memory. “Dumping core" is a historical term that means writing an image of the code and data 
memory segments to disk. (b) This signal can be neither caught nor ignored. (Source: man 7 signal. Data 


from the Linux Foundation.) 


A signal is a small message that notifies a process that an event of some type 
has occurred in the system. Figure 8.26 shows the 30 different types of signals that 
are supported on Linux systems. 

Each signal type corresponds tọ some kind of system event. Low-level hard- 
ware exceptions are processed by the kernel’s exception handlers and would not 
normally be visible to user processes. Signals provide a mechanism for exposing 
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the occurrence of such exceptions to user processes. For example, if a process at- 
tempts to divide by zero, then the kernel sends it a SIGFPE signal (number 8). 
If a process executes an illegal instruction, the kernel sends it a SIGILL signal 
(number 4). If a process makes an illegal memory reference, the kernel sends it a 
SIGSEGV signal (number 11). Other signals correspond to higher-level software 
events in the kernel or in other user processes. For example, if you type Ctrl+C 
(i.e., press the Ctrl key and the ‘c’ key at the same time) while a process is running 
in the foreground, then the kernel sends a SIGINT (number 2) to each process in 
the foreground process group. A process can forcibly terminate another process 
by sending it a SIGKILL signal (number 9). When a child process terminates or 
stops, the kernel sends a SIGCHLD signal (number 17) to the parent. 













8.5.1 Signal Terminology 


The transfer of a signal to a destination process occurs in two distinct steps: 







Sending a signal. The kernel sends (delivers) a signal to a destination process by 
updating some state in the context of the destination process. The signal 
is delivered for one of two reasons: (1) The kernel has detected a system 
event such as a divide-by-zero error or the termination of a child process. 
(2) A process has invoked the xi11 function (discussed in the next section) 
to explicitly request the kernel to send a signal to the destination process. 
A process can send a signal to itself. 


Receiving a signal. À destination process receives a signal when it is forced by 
the kernel to react in some way to the delivery of the signal. The process 
can either ignore the signal, terminate, or catch the signal by executing 
a user-level function called a signal handler. Figure 8.27 shows the basic 


idea of a handler catching a signal. 


















A signal that has been sent but not yet received is called a pending signal. At 
any point in time, there can be at most one pending signal of a particular type. 
If a process has a pending signal of type &, then any subsequent signals of type 
k sent to that process are not queued; they are simply discarded. A process can $ 
selectively block the receipt of certain signals. When a signal is blocked, it canbe $ 










Figure 8.27 















Signal handling. Receipt i 
of a signal triggers a (1) Signal received e ee 3 
control transfer to a signal by process it. isa | 

OE next igna B i 
handler. After it finishes handier nins E. 






processing, the handler 

returns control to the (4) Signal pana 
: returns to 21 
interrupted program. fund instruction 1 
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delivered, but the resulting pending signal will not be received until the process 
unblocks the signal. 

A pendirig signal is received at most once. For each process, the kernel main- 
tains the set of pending signals in the pending bit vector, and the set of blocked 
signals in the blocked bit vector.! The kernel sets bit & in pending whenever a 
signal of type k is delivered and clears bit k in pending whenever a signal of type 
k is received.. Aba 

ta 
8.5.2 Sending Signals 


Unix systems provide a number of mechanisms for sending signals to processes. 
‘All of the nfechanisms rely on the notion of a process group. 


Process Groups 


Every process belongs to exactly one process group, which is identified’ by a 
positive integer process group ID. The getpgrp function returns the process group 


ID of the current process. 
š { r ^u 


#include <unistd.h> 


pid.t getpgrp(void); 
Returns: process group ID of calling process 


LY - Y 


By default, a child ptocess belohgs'tb the same’ process group as its parent. A 
process can change the*process group of itself or another process by using the 
sétpgid function: 


> 


+ 


#include <unistd.h> 
int setpgid(pid_t pid, pid_t pgid); 


Returns: 0 on success, —1 on error 


The setpgid function changes the process group of process pid to pgid. If pid is 
zero, the PID of the current process is used. If pgid is zero, the PID of the process 
specified by pid is used for the process group ID. For example, if process 15213 is 
the calling process, then 


setpgid(0, 0); 


creates a new process group whose process group ID is 15213, and adds process 
15213 to this new group. 


1. Also known as the signal mask. 
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Figure 8.28 
Foreground and 


Sending Signals with the /bin/kill Program 


The /bin/kill program sends an arbitrary signal to another process. For example, 
the command 


linux» /bin/kill -9 15213 


sends signal 9 (SIGKILL) to process 15213. A negative PID causes the signal to 
be sent to every process in process group PID. For example, the command 


linux» /bin/kill -9 -15213 


sends a SIGKILL signal to every process in process group 15213. Note that we 
use the complete path /bin/kill here because some Unix shells have their own 
built-in kill command. 


Sending Signals from the Keyboard 


Unix shells use the abstraction of a job to represent.the processes that are created 
as a result of evaluating a single command line. At any point in time, there is at 
most one foreground job and zero or more background jobs. For example, typing 


linux> is / sort 


creates a foreground job consisting of two processes connected by a Unix pipe: one 
running the 1s program, the other running the sort program. The shell creates 
a separate process group for each job. Typically, the process group ID is taken 
from one of the parent processes in the job. For example, Figure 8.28 shows a 
shell with one foreground job and two background jobs. The parent process in the 
foreground job has a PID of 20 and a process group ID of 20. The parent process 
has created two children, each of which are also members of process group 20. 


background process 


groups. 


Back- 
ground 
job #2 


Background Background 
process group 32 process group 40 


Foreground 
process group 20 
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Typing Ctrl+C at the keyboard causes the kernel to send‘a SIGINT signal to 
every process in the foreground process group. In the default case, the result is to 
terminate the foreground job. Similarly, typing Ctrl+Z causes the kérne] to send a 
SIGTSTP signal to every process in the foreground process group. In the default 
case, the result is to stop (suspend) the foreground job. 


Sending Signals with the kill Fufction 
l " " 


Processes send signals.to other processes (including themselves) by calling the 
kill function. 


ry = 


p #include <sys/types.h> ' u 
dinclude»Xsignal.h»: x 


int kill(pid t pid, int sig); 
Returns: 0 if OK, —-10n error 
4 i4 Li 
If pid is greater than zero, then:the kili function sends signal number sig to 
process'pid. If pid is equal,to,zero, then kill sends signal sig to, every: process 
in the process group of the calling process, including the calling.process itself. If 
pidisless than zero, then kiilsends signal sig to every process ir process group 
|pid| (the absolute value of pid). Figure 8.29 shows an,example of a parent that 


uses the ki11 function to send a SIGKILL signal to its child. i 
l 


= z code/ecf/kill.c 
#include "csapp.h" 


int main() 
{ 
pid_t pid; l 
/* Child sleeps until SIGKILL signaf received, then dies */ 
if ((pid,- ForkO) == 0) { , 
Pause(); /* Wait fora signal to arrive */, 
printf("control should neyer reach here! Wn"); 
exit(0); 


/* Parent sends a SIGKILL signal to a child */ 
15 Kill(pid, SIGKILL); 
16 exit(0); 
7 } ’ 


— CCC codef/ecf/kill.c 
Figure 8.29 Using the kill'function to send a signal tóa child. 





762  Chapter8 Exceptional Control Flow 


Sending Signals with the alarm Function 


A process can send SIGALRM signals to itself by calling the alarm function. 


#include <unistd.h> 


unsigned int alarm(unsigned int secs); 
Returns: remaining seconds of previous alarm, or 0 if no previous alarm 


E 


The alarm function arranges for the kernel to send a SIGALRM signal to the 
calling process in secs seconds. If secs is 0, then no new alarm is scheduled. In 
any event, the call to alarm cancels any pending alarms and returns the number 
of seconds remaining until any pending alarm was due to be delivered (had not 
this call to alarm canceled it), or 0 if there were no pending alarms. 


8.5.3 Receiving Signals 


When the kernel switches a process p from kernel mode to user mode (e.g., 
returning from a system call or completing a context switch), it checks the set of 
unblocked pending signals (pending & ~blocked) for p. If this set is empty (the 
usual case), then the kernel passes control to the next instruction (Igexi) in the 
logical control flow of p. However, if the set is nonempty, then the kernel chooses 
some signal k in the set (typically the smallest k) and forces p to receive signal 
k. The receipt of the signal triggers some action by the process. Once the process 
completes the action, then control passes back to the next instruction (Jpext) in the 
logical control flow of p. Each signal type has a predefined default action, which 
is one of the following: 


* 'The process terminates. 

* The process terminates and dumps core. 

* The process stops (suspends) until restarted by a SIGCONT signal. 
* The process ignores the signal. 


Figure 8.26 shows the default actions associated with each type of signal. 
For example, the default action for the receipt of a SIGKILL is to terminate 
the receiving process. On the other hand, the default action for the receipt of 
a SIGCHLD is to ignore the signal. A process can modify the default action 
associated with a signal by using the signal function. The only exceptions are 4 
SIGSTOP and SIGKILL, whose default actions cannot be changed. 


#include <signal.h> 
typedef void (*sighandler_t) (int); 


sighandler_t signal(int signum, sighandler_t handler) ; 
Returns: pointer to previous handler if OK, SIG_LERR on error (does not set errno) 
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The signal function can change the action associated with a signal signum in 
one of three ways: 


* If handler is SIG. IGN, then signals of type signum are ignored. 


* If handler is SIG, DEL, then the action for signals of type signum reverts to 
the default action. 


* Otherwise, handler is the address of a user-defined function, called a signal 
handler, that will be called whenever the process receives a signal of type 
signum. Changing the default action by passing the address of a handler to 
the signal function is known as installing the handler. The invocation of the 
handler is called catching the signal. The execution of thé handler is referred 
to as handling the signal. 


When a process catches a signal of type k, the handler installed for signal k is 
invoked with a single integer argument set to k. This argument allows the same 
handler function to catch different types of signals, , 

When the handler executes its return statement, control (usually) passes.back 
to the instruction in the control flow where the process was interrupted by the 
receipt of the signal. We say “usually” because in some systems, interrupted system 
calls return immediately with an error. 

Figure 8.30 shows a program that,catches the SIGINT signal that is sent 
whenever the user types Ctrl+C at the keyboard. The default action for SIGINT 


7 x code/ecf/sigint.c 
#include "csapp.h" 
Ut fe 

void sigint handler(int sig) /* SIGINT handler */ 
1 

printf("Caught SIGINT! Wn"); 

exit(0); 
h 


Oo ON DH Bw NM = 


int main() 


{ 


- = 
~ O 


/* Install the SIGINT handler */ á 
if (signal(SIGINT, sigint handler) --'SIG ERR) 
unix error("signal error"); 


PE oL ER 
Va UN 


pause(); /* Wait,for the receipt of a signal */ 


= 
n 


return 0; 


code/ecf/sigint.c 


fri 


Figure 8.30 A prográrm-ttiat uses a signal/handler tó catch a SIGINT signal. 





764 Chapter 8 Exceptional Control Flow 


Main program Handler S Handler T 


(1) Program (2) Control passes 


catches signal s bg to handier S 
U! 
(4) Control passes 


(3) Program |. to handler T 
catches signal t 


(7) Main program hex 


resumes f 
(6) Handler S returns (5) Handler T 
returns to 


, tomain program handler S 


Figure8.31 Handlers can be interrupted by other handlers. 


is to immediately terminate the process. In this example, we modify the default 
behavior to catch the signal, print a message, and then terminate the process. 

Signal handlers can be interrupted by other handlers, as shown in Figure 8.31. 
In this example, the main program catches signal s, which interrupts the main 
program and transfers control to handler S. While S is running, the program 
catches signal t # s, which interrupts S and transfers control to handler T. When 
T returns, $ resumes where it was interrupted. Eventually, 5 returns, transferring 
control back to the main program, which resumes where it left off. 


Practice Problem S: Tsolutlon bags ZB uci ae eei da cata 
Write a program called snooze that takes a single command-line argument, calls 
the snooze function from Problem 8.5 with this argument, and then terminates. 
Write your program so that the user can interrupt the snooze function by typing 
Ctrl+C at the keyboard. For example: 


linux? ./snooze 5 

CTRL+C User hits Crtl+C after 3 seconds 
Slept for 3 of 5 secs. 

linux» 





8.5.4 Blocking and Unblocking Signals 


Linux provides implicit and explicit mechanisms for blocking signals: 


Implicit blocking mechanism. By' default, the kernel blocks any pending sig- 
nals of the type currently being processed by a handler. For example, in 
Figure 8.31, suppose the program has caught signal s and is currently run- 
ning handler S. If another signal s is sent to the process, then s will become 
pending but will not be received until after handler S returns. 


Explicit blocking mechanism. Applications can explicitly block and unblock 
selected signals using the sigprocmask function and its helpers. 
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#include <signal.h> 
er | 1 r a 
*int Pigprocmask(int how, const sigset t *set, sigset t *oldset); 

int Sigemptyset(sigset t *sét); 

Tk a "i 
,0nt sSigfillset(sigset t *set); 
int sigaddset(sigset t *set, int signum); 


rat, 


i 


int sigdelset(sigset t *set, int signum); 


5 » : Returns: Oif OK, —1 Qp error 


int sigismember(const sigset t «bot, int sigtum) ; 


Returns: 1 if member, 0 if not, —1 on error 





a t $ 
The sigprocmask function changes the set of currently blocked signals (the 
blocked bit vectot described in Section 8.5.1). The specific behaviot depends on 
the value of how: : 
te fink ze 
SIG. BLOCK. Add the signals in set to blocked (blocked = blocked..| set). 


SIG_LUNBLOCK. Remove the signals ih set ftom blocked (blocked = 
F blocked & -sét). f M 
SIG. SETMÁSK. blocked = set. 


If oldset is non-NULL, thé previous value of the blocked bit Vector is stored in 
oldset." on "hs 

Signal sets such as set are manipulated using the following functions: The 
sigemptyset initializes set to the empty set. The sigfillset function adds every 
signal to set. The sigaddset function, adds signum to, set, sigdelset deletes 
signum from set, and sigismember returns 1 if signum is a member of set, and 
0 if not. 

For example, Figure,8.32 shows how you would use sigprocmask to tempo- 
rarily block the receipt of SIGINT signals. 


i [o 
1 Sigset_t mask, prev mask; d 
2 M: 
3 Sigemptyset (&ihask) ; 
4 Sigaddset(&mask, SIGINT); i 
5 1 
6 /* Block SIGINT and save previous.blocked set */ 
7 Sigprocmask(SIG, BLOCK, &mask, &prev mask); 
. t gy k 
8 © 4 Code region that will not be interrupted by SIGINT 
9 /*,Restore previous blocked set.;unblocking SIGINT x/ 
10 Sigprotmask(SIG_SETMASK, &prevsmask; NULL); 


11 


Figure 8.32 Temporarily blocking a signal from being received. 
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8.5.5 Writing Signal Handlers 


Signal handling is one of the thornier aspects of Linux system-level programming. 
Handlers have several attributes that make them difficult to reason about: (1) Han- 
dlers run concurrently with the main program and share the same global variables, 
and thus can interfere with the main program and with other handlers. (2) The 
rules for how and when signals are received is often counterintuitive. (3) Different 
systems can have different signal-handling semantics. 

In this section, we address these issues and give you some basic guidelines for 
writing safe, correct, and portable signal handlers. 


Safe Signal Handling 


Signal handlers are tricky because they can run concurrently with the main pro- 
gram and with each other, as we saw in Figure 8.31. If a handler and the main 
program access the same global data structure concurrently, then the results can 
be unpredictable and often fatal. 

We will explore concurrent programming in detail in Chapter 12. Our aim 
here is to give you some conservative guidelines for writing handlers that are 
safe to run concurrently. If you ignore these guidelines, you run the risk of in- 
troducing subtle concurrency errors. With such errors, your program works cor- 
rectly most of the time. However, when it fails, it fails in unpredictable and 


unrepeatable ways that are horrendously difficult to debug. Forewarned is fore- 
armed! 


GO. Keep handlers as simple as possible. The best way to avoid trouble is to keep 
your handlers as small and simple as possible. For example, the handler 
might simply set a global flag and return immediately; all processing 
associated with the receipt of the signal is performed by the main program, 
which periodically checks (and resets) the flag. 


G1. Call only async-signal-safe functions in your handlers. A function that is 
async-signal-safe, or simply safe, has the property that it can be safely 
called from a signal handler, either because it is reentrant (e.g, ac- 
cesses only local variables; see Section 12.7.2), or because it cannot 
be interrupted by a signal handler. Figure 8.33 lists the system-level 
functions that Linux guarantees to be safe. Notice that many popu- 
lar functions, such as printf, sprintf, malloc, and exit, are not on 
this list. 1 

The only safe way to generate output from a signal handler is to use a 
the write function (see Section 10.1). In particular, calling printf or E 
sprintf is unsafe. To work around this unfortunate restriction, we have E : 
developed some safe functions, called the Sio (Safe I/O) package, that 4 
you can use to print simple messages from signal handlers. 
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Sigqueue 
sigset 
sigsuspend 


fexecve 
fork 
fstat 
fstatat 
fsync 
ftruncate 


-Exit 

-exit 

abort 
accept 
access 

aio error 
aio returp A futimens 
aio. suspend Eetegid 
geteuid, 
getgid 

! getgroups 


alarm 

bind 
cfgetispeed 
cfgetospeed 
cfgetispeed, 
cfsetospeed, 
chdir 

chmod 

chown 

clock: gettihé 
close 
connect , Link 
creat ‘Linkat 
dup listen 
dup2 lseok 
execl lstat 
mkdir 
mkdirat 
mkfifo 
mkfifoat 
mknod 
mknodat 
open 


getpgrp, 
getpid 
getppid 


rf 


gétuid 
kill 


execle 
execv 
execve 
faccessat 
fchmod 
fchmodat 
fchown 
fchownat 
fcntl 
fdatasync 


openat 
pause 
pipe 


getpeernamg 


E 


getsockname ? 
getsockopt 


poll 
posix_trace_event 
pselect 

raise 

read 

readlink 
readlinkat 

recv 


j 


recvfrom 
recvmsg 
rename 
renameat 
rindir 
select 
sem_post, 
send::i 
sendmsg 
sendto 
setgid 
setpgidt 
setsid 
setsockopt 
setuid 
shutdown 
sigaction 
sigaddset 
sigdélset 
sigemptyset 
sigfillset 
sigismember 
signal 
sigpause 
sigpending 
sigprocmask 


Sleep 
Sockatmark 
Socket 
socketpair 
stat 

symlink 
Byulinkat 
tcdrain 
tcflow 
tcflush 
tcgetattr 
tcgetpgrp 
tcsendbreak 
tcsetattr 
tcsetpgrp 
time 

timer gétoverrun 
"timer gettime 
timer settime 
times 

umask 

uname 

unlink 
unlinkat 
utime 
utimensat 
utimes 

wait 

waitpid 
write 


——_—eeawosxns*“"——— a 


Figure 8.33 Async-signal-safe functions. (Source: man 7 signal. Data from the Linux 


Foundation:) 


767 





768 Chapter 8 Exceptional Control Flow 


#include "csapp.h" 


ssize t sio putl(long v); 
ssize t sio puts(char s[]); 


void sio .error(char s{]); 


Returns: number of bytes transferred if OK, —1 on error 





Returns: nothing 


The sio. put1 and sio. puts functions emit a long and a string, respec- 
tively, to standard output. The sio. error function prints an error mes- 
sage and terminates. 

Figure 8.34 shows the implementation of the Sto package, which uses 
two private reentrant functions from csapp. c. The sio_strlen function 
in line 3 returns the length of string s. The sio_ltoa function in line 10, 
which is based on the itoa function from [61], converts v to its base b 
string representation in s. The _exit function in line 17 is an async-signal- 
safe variant of exit. F 

Figure 8.35 shows a safe version of the SIGINT handļer from Fig- 
ure 8.30. 


G2. Save and restore errno. Many of the Linux async-signal-safe functions set 


errno when they return with an error. Calling such functions inside a 
handler might interfere with other parts of the program that rely on errno. 


eee code/src/csapp.c 


P 
O Uv R60 NO 0000 BWN 


18 


ssize t sio puts(char s[]) /* Put string x/ 


{ 


$ 


return write(STDOUT_FILENO, s, sio_strlen(s)); 


ssize t sio putl(long v) /* Put long */ 


{ 


h 


char s(128]; 


sio ltoa(v, s, 10); /* Based on K&R itoaO */ 
return, sio_puts(s) ; 


void sio error(char s[]) /* Put error message and exit */ 


1 


} 


sio puts(s); 
.exit(1); 


M e 


Figure 8.34 The Sio (Safe 1/0) package for signal handlers. 
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code/ecf/sigintsafe.c 
? #include "csapp.h" 
2 
3 void sigint_handler(int sig) /* Safe SIGINT handler */ 
4 d 
5 Sio puts("Caught SIGINT!Àn"); /* Safe output */ 
6 .exit(0); , /* Safe exit */ 
7 +} 
code/ecf/sigintsafe.c 





tafe i 
Figure 8.35 A safe version of the SIGINT handler from Figure 8.30. 


The workaround is to save errno toa local variable on entry to the handler 
and restore it before the handler returns. Note that this is only necessary 
if the handler returns, It is not necessary if the handler terminates the 
process by calling _exit. 


G3. Protect accesses to shared global data structures-by blocking all signals. If 
Ps a handler shares a global data structure with the main program or with 
' — other handlers, then your handlers and main program should temporarily 
‘block all signals while accessing (reading or writing) that data structure. 
The reason for this rule is that accessing a data structure d from the main 
program typically requires a sequence of instructions. If this instruction 
sequence is interrupted by a handler that accesses d, then the handler 
might find d in an inconsistent state, with unpredictable results. Tempo- 
rarily blocking signals "while you access d guarantees that a handler will 

not interrupt the instruction sequence. 


G4. Declare global variables with volatile. Consider a handler and main rou- 
i tine that share a global variable g. The handler updates g, and main pe- 
riodically reads g. To an optimizing compiler, it would appear that the 
value of g never changes in main, and thus it would be safe to use a copy 
of g that is cached in a register to satisfy every reference to g. In this case, 
the main function would never see the updated values from the handler. 
You can tell the compiler not to cache a variable by declaring it with 

the volatile type qualifier. For example: 


d volatile int g; 


The volatile qualifier forces the compiler to read the value of g from 
memory each time it is referenced in the code. In general, as with any 
shared data structure, each access to a global variable should be protected 
by temporarily blocking signals. 


G5. Declare flags with sig atomic t. In one common handler design, the 
handler,records the receipt of the signal by writing to a global flag. The 
main program periodically reads the flag, responds to the signal, and 
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clears the flag. For flags that are shared in this way, C provides an integer 
data type, sig_atomic_t, for which reads and writes are guaranteed to be 
atomic (uninterruptible) because they can be implemented with a single 
instruction: 


volatile sig_atomic_t flag; 


Since they can’t be interrupted, you can safely read from and write to 
sig_atomic_t variables without temporarily blocking signals. Note that 
the guarantee of atomicity only applies to individual reads and writes. 
It does not apply to updates such as flag++ or flag = flag + 10, which 
might require multiple instructions. 


Keep in mind that the guidelines we have presented are conservative, in 
the sense that they are not always strictly necessary. For example, if you know 
that a handler can never modify errno, then you don’t need to save and restore 
errno. Or if you can prove that no instance of printf can ever be interrupted 
by a handler, then it is safe to call printf from the handler. The same holds for 
accesses to shared global data structures. However, it is very difficult to prove such 
assertions in general. So we recommend that you take the conservative approach 
and follow the guidelines by keeping your handlers as simple as possible, calling 
safe functions, saving and restoring errno, protecting accesses io shared data 
structures, and using volatile and sig_atomic_t. 


Correct Signal Handling 


One of the nonintuitive aspects of signals is that pending signals are not queued, 
Because the pending bit vector contains exactly one bit for each type of signal, 
there can be at most one pending signal of any particular type. Thus, if two signals 
of type k are sent to a destination process while signal & is blocked because the 
destination process is currently executing a handler for signal k, then the second 
signal is simply discarded; it is not queued. The key idea is that the existence of a 
pending signal merely indicates that at least one signal has arrived. 

To see how this affects correctness, let's look at a simple application that 
is similar in nature to real programs such as shells and Web servers. The basic 
structure is that a parent process creates some children that run independently for 
a while and then terminate. The parent must reap the children to avoid leaving 
Zombies in the system. But we also want the parent to be free to do other work 
while the children are running. So we decide to reap the children with a SIGCHLD 
handler, instead of explicitly waiting for the children to terminate. (Recall that 
the kernel sends a SIGCHLD signal to the parent whenever one of its children 
terminates or stops.) 

Figure 8.36 shows our first attempt. The parent installs a SIGCHLD handler 
and then creates three children. In the meantime, the parent waits for a line of 
input from the terminal and then processes it. This processing is modeled by 
an infinite loop. When each child terminates, the kerncl notifies the parent by 
sending it a SIGCHLD signal. The parent catches the SIGCHLD, reaps one child, 
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ee dode/ecf/signall.c 


/* WARNING: This code is buggy! de 


1 

2 

3 void handleri(int pe F K 

4 t 

5 int olderrno = errno; á 
6 

7 if ((waitpid(-1, NULL, 0)) « 0) : i 
8 sio_error("waitpid error"); 

9 Sio_puts("Handler reaped child\n"); 

10 Sleep(1); 

11 errno = olderrno; ; 

12 } i 

B y i d "a 
144, int main() 1 

15 ii: 

16 int i, n; Á " 
17 char buf [MAXBUF] ; 

18 

19 if (signal(SIGCHLD, handleri) == SIG. ERR) 

20 unix error("signal error"); 

21 

22 /* Parent creates children */, 

23 for:(i-0; i < 3; i++) { 

24 if (Fork() == 0) { 

25 printf("Hello from child: %d\n", (int)getpid()); 
26 exit(0); i 

27 } t 

28 } 

29 v 

30 /* Pafent waits!for- ériinal input 'and then processes it */ 
3t if ((n = read(STDIN_ FILENO, ÜBuf, sizéof(but))) < Ó 

32 unix_errot ("read"); i ABT 

33 a 

34 “printf ("Parent processing input\n"); »: 

3 While (1) : 

16 ? y &o: 

37 "cdi 

38 exit(0); 

3 °} 3 


~r o ery een code/ecf/signall € 


Figure 8.36 signali. This program is flawed because it assumes that"signals are? 
queued. 1 t 
^r [4 


i + 
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does some additional cleanup work (modeled by the sleep statement), and then 


returns. 
The signali program in Figure 8.36 seems fairly straightforward. When we 


run it on our Linux system, however, we get the following output: 








































linux» ./signall 
Hello from child 14073 
Hello from child 14074 
Hello from child 14075 
Handler reaped child 
Handler reaped child 
CR 

Parent processing input 


From the output, we note that although three SIGCHLD signals were sent to the 
parent, only two of these signals were received, and thus the parent only reaped 
two children. If we suspend the parent process, we see that, indeed, child process 
14075 was never reaped and remains a zombie (indicated by the string <defunct> 
in the output of the ps command): 


Ctrl4Z 





Suspended 
linux? ps t 
PID TTY STAT TIME COMMAND 
14072 pts/3 T 0:02 ./signali 
14075 pts/3 Z 0:00 [signali] <defunct> 
14076 pts/3 R+ 0:00 ps t 


What went wrong? The problem is that our code failed to account for the fact 
that signals are not queued. Here’s what happened: The first signal is received 
and caught by the parent. While the handler is still processing the first signal, the 
second signal is delivered and added to the set of pending signals. However, since 
: SIGCHLD signals are blocked by the SIGCHLD handler, the second signal is not 
: received. Shortly thereafter, while the handler is still processing the first signal, 
l the third signal arrives. Since there is already a pending SIGCHLD, this third 
SIGCHLD signal is discarded. Sometime later, after the handler has returned, 
the kernel notices that there is a pending SIGCHLD signal and forces the parent 
to receive the signal. The parent catches the signal and executes the handler a 
second time. After the handler finishes processing the second signal, there are no 
more pending SIGCHLD signals, and there never will be, because all knowledge 
of the third SIGCHLD has been lost. The crucial lesson is that signals cannot be 
used to count the occurrence of events in other processes. 

To fix the problem, we must recall that the existence of a pending signal only 
implies that at least one signal has been delivered since the last time the process 
received a signal of that type. So we must modify the SIGCHLD handler:to reap 
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TT MÀ — — — code/ecf/signal2.c E 
1 void hanüler2(int sig) EB 
2 ¢ 
3 int olderrno - errno; E. 
4 " 
5 while (waitpid(-1, NULL, 0) > 0) ( J 
6 Sio puts("Handler reaped child\n"); 1 
7 } A 
8 if (errno != ECHILD) f 1 
9 Sio .error("waitpid error"); i 
10 Sleep(1); f 
11 errno = olderrno; | 1 
12 } i J 
MCCC MIC ce c E C code/ecf/signal2.c 13 


Figure 8.37 signal2. An improved version of Figure 8.36 that correctly accounts for 
the fact that signals are not queued. 


as many zombie children as possible each time it is invoked. Figure 8.37 shows the | 
modified SIGCHLD handler. 

When we run signal2 on our Linux system, it how correctly reaps all of the 4 
zombie children; Ó | 


linux? ./signal2 

Hello from child 15237 : 
Hello from child 15238 [ 
Hello from child 15239 E 
Handler reaped child 
Handler reaped child 


Handler reaped child 4 

CR E 

Parent processing input y 
! 
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owing program? 


TX te & Ld | | 

af . a iu i den y ads I b | a 
j c a . 1 
ROS QU oH ET ooo ge a Onecare signalprob0.c ! 
volatile long counter = 2; S 1 s 

1 ET M ij 


void handleri(int sig) 
"ii 


{ 


q: 
'sigset t mask, prev mask; 
; p f 
jJ » f E * 
Sigfillset(kmask); dd 
'"SigprocmaSk(SIG BLOCK, &mask, Eprev mask); /* Block sigs */ 
Y te 
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9 Sio putl(--counter); 
70 Sigprocmask (SIG_SETMASK, &prev mask, NULL); /* Restore sigs */ 
11 
exit (0); 


main() 


pid_t pid; 
sigset_t mask, prev_mask; 


printf ("%1d", counter) ; 
fflush(stdout) ; 


signal (SIGUSR1, handler); 

if ((pid = ForkO) == 0) f 
while(i) {}; 

} 

Kill(pid, SIGUSR1); 

Waitpid(-1, NULL, 0); 


Sigfillset (mask) ; 

Sigprocmask(SIG_BLOCK, &mask, &prev mask); /* Block sigs x/ 
printf ("%ld", ++counter) ; 

Sigprocmask(SIG_SETMASK, &prev.mask, NULL); /* Restore sigs */ 


exit(0); 
} 


a — —— codefecf/signalprobÜ.c 


T A e i MM 


Portable Signal Handling 


Another ugly aspect of Unix signal handling is that different systems have different 
signal-handling semantics. For example: 


e The semantics of the signal function varies Some older Unix systems restore 
the action for signal k to its default after signal k has been caught by a handler. 
On these systems, the handler must explicitly reinstall itself, by calling signal, 
each time it runs. 


System calls can be interrupted. System calls such as read, wait, and accept 
that can potentially block the process for a long period of time are called 
slow system calls. On some older versions of Unix, slow system calls that are 
interrupted when a handler catches a signal do not resume when the signal 
handler returns but instead return immediately to the user with an error 
condition and errno set to EINTR. On these systems, programmers must 
include code that manually restarts interrupted system calls. 
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2 à LL : code/src/csapp.c 
J handler_t *Signal (int signum, handler_t *handler) 
25. ! í 
3 struct sigaction action, old_action; 
4 3 
5 action.sa handler = handjer; pa 
6 sigemptyset(&£action,sa mask); /* Block sigs of type bejng handled x/ 
7 action.sa flags = SA RESTART; /* Restart, syscalls if possible */ 
8 T DA 
9 25 if (sigaction(signum, &action, &old action) « 0) 
10 ; unix error(^Signal error"); 
11 return (old. action.sa. handler); 
12 + n à 
So OO TT át code/src/csapp.c 


` Hoste à P H P a, H ri sht : " " 
Figure 8.38 Signal. A wrapper for sigaction that provides pórtable signal handling on Posix-compliant 
systems. : 


aft 
nt 
To deal with these issues, the Posix standard defines the sigaction function, which 
allows users to clearly specify the signal-handling semantics they want-when they 
install a handler. 


,finclude <signal .h> 


int digaction(int signum, struct sigaction *act, 
struct sigactioh *oldact); 


Returns: 0 if OK, —1 on error 


i 





The sigactiotr function is unwieldy because it requires the user to set the'óntries 
ofa complicated structure. A cleaner approach, originally proposed by W. Richard 
Stevens (110], is to define a wrapper function, called Signal, that calls sigaction 
for ús. Figure 8.38 shows the definitión of Signal; whichris invoked-in the same 
way as the signal function. 

_ The Signal wrapper installs a signal handler with the following signal- 
handling semantics: 


ie 


i ; 
-e Only signals of the type currently being processed by the handler are blocked. 
e, As with all.signal implementations, signals are not queued: 
s e-Interrupted system calls are automatically restarted whenever possible. 


e Onte'the signal handler is installed, if remains installed until Signal is‘talled 
with a handler argument of either SIG_IGN or SIG_DFL. 


de 


We will use the Signal wrapper in all of our code. 
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8.5.6 Synchronizing Flows to Avoid Nasty Concurrency Bugs 


The problem of how to program concurrent flows that read and write the same 
storage locations has challenged generations of computer scientists. In general, 
the number of potential interleavings of the flows is exponential in the number of 
instructions. Some of those interleavings will produce correct answers, and others 
will not. The fundamental problem is to somehow synchronize the concurrent 
flows so as to allow the largest set of feasible interleavings such that each of the 
feasible interleavings produces a correct answer. 

Concurrent programming is a deep and important problem that we will discuss 
in more detail in Chapter 12. However, we can use what you've learned about 
exceptional control flow in this chapter to give you a sense of the interesting 
intellectual challenges associated with concurrency. For example, consider the 
program in Figure 8.39, which captures the structure of a typical Unix shell. The 
parent keeps track of its current children using entries in a global job list, with one 
entry per job. The add job and deletejob functions add and remove entries from 
the job list. 

After the parent creates a new child process, it adds the child to the job 
list. When the parent reaps a terminated (zombie) child in the SIGCHLD signal 
handler, it deletes the child from the job list. 

At first glance, this code appears to be correct. Unfortunately, the following 
sequence of events is possible: 


1. The parent executes the fork function and the kernel schedules the newly 
created child to run instead of the parent. 


2. Before the parent is able to run again, the child terminates and becomes a, 
zombie, causing the kernel to deliver a SIGCHLD signal to the parent. 


3. Later, when the parent becomes runnable again but before it is executed, the 
kernel notices the pending SIGCHLD and causes it to be received by running 
the signal handler in the parent. 


The signal handler reaps the terminated child and calls deletejob, which does 
nothing because the parent has not added the child to the list yet. 


. After the handler completes, the kernel then runs the parent, which returns 
from fork and incorrectly adds the (nonexistent) child to the job list by calling 
addjob. 


Thus, for some interleavings of the parent’s main routine and signal-handling 
flows, it is possible for deletejob to be called before add job. This results in an 
incorrect entry on the job list, for a job that no longer exists and that will never be 
removed. On the other hand, there are also interleavings where events occur in 
the correct order. For example, if the kernel happens to schedule the-parent to run 
when the £ork call returns instead of the child, then the parent will correctly add 
the child to the job list before the child terminates and the signal handler removes 
the job from the list. E 

This is an example of a classic synchronization error known as a race. In this 
case, the race is between the call to addjob in the main routine and the call to 
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1 H roa 


M ——— M — — — code/ecf/procmaskl.c 


1' /* WARNING: This code is buggy! */ 
2! void handler(int ais 
$8 ne 
4 i int ,giderrno 7 errno; ^ 
5 ;Sigset.. t mask all, próv- ‘all; id i e 
6 ipid. t pid; : t 
7 ot ^" 
' dr t 
8 Sigfillsét (émask_ ali); 
9 “while ((pid = waitpid(-1, NULL, 63)" »0) 07K Reap a Zombie child */ 
10 Sigprocmask(SIG_BLOCK, &mask_all, &prev_all); 
11 deletejob(pid); /* Delete the'child fromthe job list. */ 
12 Sigprocmask(SIG SETMASK, &prev all, NULL); 
TE F 3 
14 ^4f (errnó != ECHILD) 
45^ Sio error("Wáitpid error") !- 2d 
16 errno - olderrno; 
17: } 1 2 i J 
18 1 L $ J 
19  fht main(int argc, char **argv)' 
20 if £^" f id off ^ 
27 “| int pid; 
22 isigset_t mask all, prév_all; 
23 E . ai 
24 Sibfillset(Émásk' all); Ti 
25 SifüaY(SIGCHLD, handler)’; 
26 initjobs(); /* Initialize the job list */ 
27 "ENS 
28 while (1) 1 
29 if ((pid = ForkO) == 0) { /* Child process */ 
30 Ur Execve("/bin/date", argv, RED; 
37 i PES ah "4 
32° A TA A &mask_all, dccem. /* Parént brocess */ 
33 addjob(pid); /* Add the child to thé job list */ 
34 Sigprocmask(SiG SETMASK; &prev all, 'NULLY i 
35 } 
36 exit(0); 
3 } 


T ——————— — — — ——,—— codefecf/procmask1.c 


Figure 8.39 ‘A shell program with a subtle synchronization error. If thé child términates before the parent 
is able to run, then’ add job and deletejob will be called-in:thé wrong order. 


i i i 
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deletejob in the handler. If addjob wins the race, then the answer is correct. If 
not, the answer is incorrect. Such errors are enormously difficult to debug because 
it is often impossible to test every interleaving. You might run the code a billion 
times without a problem, but then the next test results in an interleaving that 
triggers the race. 

Figure 8.40 shows one way to eliminate the race in Figure 8.39. By blocking 
SIGCHLD signals before the call to fork and then unblocking them only after we 
have called addjob, we guarantee that the child will be reaped after it is added to 
the job list. Notice that children inherit the blocked set of their parents, so we must 
be careful to unblock the SIGCHLD signal in the child before calling execve. 


8.5.7 Explicitly Waiting for Signals 


Sometimes a main program needs to explicitly wait for a certain signal handler to 
run. For example, when a Linux shell creates a foreground job, it must wait for 
the job to terminate and be reaped by the SIGCHLD handler before accepting 
the next user command. 

Figure 8.41 shows the basic idea. The parent installs handlers for SIGINT and 
SIGCHLD and then enters an infinite loop. It blocks SIGCHLD to avoid the race 
between parent and child that we discussed in Section 8.5.6. After creating the 
child, it resets pid to zero, unblocks SIGCHLD, and then waits in a spin loop for 
pid to become nonzero. After the child terminates, the handler reaps it and assigns 
its nonzero PID to the global pid variable. This terminates the spin loop, and the 
parent continues with additional work before starting the next iteration. 

While this code is correct, the spin loop is wasteful of processor resources. We 
might be tempted to fix this by inserting a pause in the body of the spin loop: 


while (!pid) /* Race! */ 
pauseO ; 


Notice that we still need a loop because pause might be interrupted by the 
receipt of one or more SIGINT signals. However, this code has a serious race 
condition: if the SIGCHLD is received after the while test but before the pause, 
the pause will sleep forever. 

Another option is to replace the pause with sleep: 


while (!pid) /* Too slow! */ 
sleep(1); 


While correct, this code is too slow. If the signal is received after the while 
and before the sleep, the program must wait a (relatively) long time before it 
can check the loop termination condition again. Using a higher-resolution sleep 
function such as nanosleep isn't acceptable, either, because there is no good rule 
for determining the sleep interval. Make it too small and the loop is too wasteful. 
Make it too high and the program is too slow. 
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: code/ecf/procmask2.c d 

1 void handler(int sig) 1 

2 (1 D 

3 int olderrno - errno; E. 

4 sigset t mask all, prev all; d 

5 pid t pid; i 

6 i 

7 Sigfillset(&mask, all); i 

8 while ((pid = waitpid(-1, NULL, 0)) > 0) (./* Reap a zombie ‘child */ A 

9 Sigprocmask(SIG BLOCK, &mask_all, &prev all); f 

10 deletejob(pid); /* Delete the child from the job list */ 3 
n Sigprocmask(SIG SETMASK, &prev all, NULL); A 
12 } i 
13 if (errno != ECHILD) E 
14 Sio. error("waitpid error"); 4 
15 errno = olderrno; P 
16 } i 
y | 
18 int main(int argc, char **argv) 1 
19 1 D 
20 int pid; t E 
21 Sigset t mask all, mask one, prev. one; r E 
22 : 
23 Sigfillset(&mask al1); . i 
24 Sigemptyset(&mask one); i 
25 'Sigaddset(kmask one, SIGCHLD); i 
26 Signal (SIGCHLD, Handler); ^1 | 
27 initjobs(); /* Initialize the job list */ M 
28 [ 
29 while (1) { 3 
30 Sigprocmask(SIG, BLOCK, &mask one, &prev one); /* Block SIGCHLD */ E 
31 if ((pid = Fork()) == 0) ( /* Child process */ à 
32 Sigprocmask(SIG_SETMASK, &prev one, NULL); /* Unblock SIGCHLD */ 3 
33 Execve("/bin/date*, argv, NULL); 1 
34 ) 4 
35 Sigprocmask(SIG BLUCK, &mask all, NULL); /* Parent prodess */ f 
36 addjob(pid); /* Add the child to the job list */ 1 
37 Sigprocmask(SIG SETMASK, &prev one, NULL); /* Unblock SIGCHLD */ 1 
38 } i 
39 exit(0); 1 
40 } 4 








code/ecf/procmask2.c 





| Figure 8.40 Using sigprocmask to synchronize processes. In this example, the parent ensures that 
add job executes before the corresponding delete job. 


T aT 1 ~i I 
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a  coddelec ff waitforsignal.c 


#include "csapp.h" 


N = 


volatile sig_atomic_t pid; 


void sigchld handler(int s) 
i 


int olderrno - errno; 
pid = waitpid(-1, NULL, 0); 
errno = olderrno; 


oOo ON A Uu 


m 


void sigint handler(int s) 
i 
+ 


int main(int argc, char **axgv) 
{ 


sigset_t mask, prev; 


Signal (SIGCHLD, sigchld_handler) ; 
Signal(SIGINT, sigint handler); 
Sigemptyset (&mask) ; 
Sigaddset(&mask, SIGCHLD) ; 


while (1) ( 
Sigprocmask(SIG BLOCK, &mask, kprev); /* Block SIGCHLD */ 
if (ForkQ == 0) /* Child x/ 
exit(0); 


/* Parent */ 
pid = 0; 
Sigprocmask (SIG_SETMASK »,&prev, 'NULL) ; /* Unblock SIGCHLD */ 


/* Wait for SIGCHLD to be received (wasteful) */ 
while Gl!pid) ` 


/* Do some work after receiving SIGCHLD */ 
printf("."); 
} 
exit (0); 
} 
E a 


HOM 2 h 


Figure 8.41 Waiting for a signal with a spin loop. This code is correct, but the spin loop is wasteful. 
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The proper solution is to use sigsuspend.. 


#include <signal.h> 


int sigsuspend(const sigset_t *mask); 


Returns: —1 





The sigsuspend function temporarily replaces the current blocked set with mask 
and then suspends the process until the receipt of a signal whose action is either 
to run a handler or to terminate the process. If the action is to terminate, then the 
process terminates without returning from sigsuspend. If the action is to run a 
handler, then sigsuspend returns after the handler returns, restoring the blocked 
set to its state when sigsuspend was called. 

The sigsuspend function is equivalent to an atomic (uninterruptible) version 
of the following: 


Sigprocmask(SIG BLOCK, &mask, &prev); 
2 pause(); 
3 sigprocmask(SIG_SETMASK, &prev, NULL); 


The atomic property guarantees that the calls to sigprocmask (line 1) and pause 
(line 2) occur together, without being interrupted. This eliminates the potential 
race where a signal is received after the call to sigprocmask and before the call 
to pause. 

Figure 8.42 shows how we would use sigsuspend to replace the spin loop 
in Figure 8.41. Before each call to sigsuspend, SIGCHLD is blocked. The 
sigsuspend temporarily unblocks SIGCHLD, and then Sléeps until the parent 
catches a signal. Before returning, it restores the original blocked set, which blocks 
SIGCHLD again. If the parent caught a SIGINT, then the loop test succeeds and 
the next iteration calls sigsuspénd again. If the parent caught a SIGCHLD, then 
the loop test fails and we exit the loop. At this point, SIGCHLD is blocked, and 
So we can optionally unblock SIGCHLD. This might be useful in a real shell with 
background jobs that need to be reaped. 

The sigsuspend version is less wasteful than the original spin loop, avoids the 
race introduced by pause, and is more efficient'‘than sleep. 


8.6 Nonlocal Jumps 


C provides a form of user-level exceptional control fiow, called a nonlocal jump, 
that transfers control directly from one function to another currently executing 
function without having to go through the normal call-and-return sequence. Non- 
local jumps are provided by the set jmp and longjrip functions. 


* + 
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code/ecf/sigsuspend.c 
#include "csapp.h" 


volatile sig_atomic_t pid; 


void sigchld_handler(int s) 

{ 
int olderrno = errno; 
pid = Waitpid(-1, NULL, 0); 
errno = olderrno; 


_ 
oD ON AH PR WN = 


} 


EE 
A 


void sigint handler(int s) 
{ 
+ 


int main(int argc, char **argv) 


{ 


-—— = = 2l n 
NDA da dA wN 


-à 
o 


sigset t mask, prev; 


N = 
Oo o 


Signal(SIGCHLD, sigchld handler); 
Signal(SIGINT, sigint. handler); 
Sigemptyset (&mask) ; 

Sigaddset (&mask, SIGCHLD) ; 


NN NON 
à w N 


while (1) { 
Sigprocmask(SIG_BLOCK, &mask, kprev); /* Block SIGCHLD x/ 
if (Fork() == 0) /* Child */ 
exit(0); 


NNN NN 
Ww On DA tA 


/* Wait for SIGCHLD to be received */ 
pid = 0; 
while (!pid) 

sigsuspend(&prev); 


Www we QJ UJ w 
wb WN = 0 


/* Optionally unblock SIGCHLD */ 
Sigprocmask(SIG_SETMASK, &prev, NULL); 


UJ WwW 
SN 


/* Do some work after receiving SIGCHLD */ 
printf("."); 


Uu Uv 
o o 


} 
exit (0); 


A bh A 
N- O 


code/ecf/sigsuspend.c 


Figure 8.42 Waiting for a signal with sigsuspend. 
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#intclude <setjmp.h> 


int setjmp(jmp_buf env); 


int sigsetjmp(sigjmp. buf env, int savesigs); 
Returns: 0 from set jmp, nonzero from longjmps 





The set jmp function saves the current calling environment in the env buffer, for 
later use by 1ongjmp, and returns 0. The calling environment includes the program 
counter, stack pointer, and general-purpose registers. For sübtle reasons béyond 
our scope, the value that set jmp returns should not be assigned to a variable: 


rc = setjmp(env); /* Wrong! */ 


However, it can be safely used as a test in a switch or conditional statement [62]. 


#include <setjmp.h> 


void longjmp(jmp_buf env, int retval); 


void siglongjmp(sigjmp buf env, int retval); 


Never returns 





The longjmp function restores the calling environment from the env buffer and 
then triggers a return from the most recent set jmp call that initialized env. The 
Set jmp then returns with the nonzero return value retval. 

The interactions between set jmp and long jmp can be confusing at first glance. 
The set jmp function is called once but returns multiple times: once when the 
set jmp is first called and the calling environment is stored in the env buffer, 
and once for each corresponding longjmp call. On the other hand, the longjmp 
function is called once but never returns. 

An important application of nonlocal jumps is to permit an immediate return 
from a deeply nested function call, usually as a result of detecting some error 
condition. If an error condition is detected deep in a nested function call, we can 
use a nonlocal jump to return directly to a common localized erfor handler instead 
of laboriously unwinding the call stack. 

Figure 8.43 shows an example of how this might work. The main routine first 
calls set jmp to save the current calling environment, and then calls function foo, 
which in turn calls function bar. If foo or bar encounter an error, they return 
immediately from the set jmp via a longjmp call. The nonzero return value of the 
set jmp indicates the error type, which can then be decoded and handled in one 
place in the code. 

The feature of 1ongjmp that allows it to skip up through all intermediate calls 
can have ünintended consequences. For example, if some data structures were 
allocated in the intermediate function calls with the intention to dgallocate them 
at the end of the function, the deallocation code gets skipped, thus creating a 
memory leak. 
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code/ecf/setjmp.c 
#include "csapp.h" 


jmp.buf buf; 


int errori = 0; 
int error2 = 1; 


void foo(void), bar(void); 


won A dl A ww Mo 


int main() 


{ 


a = = 
N= © 


switch(setjmp(buf)) { 
case 0: 
foo); 
break; 
case 1: 
printf("Detected an errori condition in foo\n"); 
break; 
case 2: 
printf ("Detected an error2 condition in foo\n"); 
break; 
default: ; 
printf("Unknown error condition in foo\n"); 


N NNN a m as 
w N= O wv ON DW Aw 


} 
exit (0); 


YNNN 
RaR 


/* Deeply nested function foo */ 
void foo(void) 


{ 


u N 
eco 


if (errori) 
longjmp(buf, 1); 
barO; 


w oU w 
u N = 


Y 


w w 
“wus 


void bar (void) 


i 


w w w 
oN A 


if (error2) 
longjmp(buf, 2); 


A w 
o v 


code/ecf/setjmp.c 


Figure 8.43“ Nonlocal jump example. This example shows the framework for using 
nonlocal jumps to recover fróm error conditions in deeply nested’ functions’ without 
having to unwind the entire stack. 
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So = LE ecc code/ecffrestart.c 


#include "csapp.h" 
e! sigjmp buf buf; 


void handler(int sig) 
t 

siglongjmp(buf, 1); 
} 


i$ NOAUA wN = 


int main() 


{ 


2A 2 
= o 


if (!sigsetjmp(buf, 1)) { 
Signal(SIGINT, handler); 
Sio puts("starting in"); 


- 
w 


} 
else 
Sio puts("restarting\n"); 


* í 


while(1) f 
Sleep(1); 
Sio. puts("processing... Wn"); 


} 
exit(0); /* Control never reaches here */ 
l à 

m — —— — code/ecfirestart.c 


Figure 8.44 A program that uses nonlocal jumps to restart itself when the user 
types Ctrl+C. i 


Another important application of nonlocal jumps is to branch out of a signal 
handler to a specific code location, rather than returnin gto the instruction that was 
interrupted by the arrival of the signal. Figure 8.44 shows a simple program that 
illustrates this basic technique. The program uses signals and nonlocal jumps to 
do a soft restart whenever the user types Ctrl+C at the keyboard. The sigset jmp 
and siglongjmp functions are versions of set jmp and longjmp that can be used 
by signal handlers. ( 

The initial call to the sigset jmp function saves the calling environment and 
signal context (including the pending and blocked signal vectors) when the pro- 
gram first starts. The main routine then enters an infinite prócessing loop. When 
the.uber types Ctrl+C, the kernel sends a SIGINT'signal to the process, which 
catthes it. Instead of returning from.the signal handler, which would pass control 
back’to the interrupted processing Ioop, the handler performs a' nonlocal jump 
back to the beginning of the main program. When we run the program on our 
system, we get the following output: 
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» 


a » LM e awe 3 
Aside Software exceptions in C++ and Java i 


Qd g 

The exception mechanisms provided by C++ and Java are higher-level, more structured versions of the 
C get jmp and longjmp functions. You can think of a catch clausé inside a try statement as being akin 
to a set jmp function. Similarly, a throw statement is similar to a longjmp function. 


wd T $e suero En 


linux? ./restart 
starting 
processing... 
processing... 
Ctrl+C 
restarting 
processing... 
Ctrl+C 
restarting 
processing... 


There a couple of interesting things about this program. First, To avoid a race, 
we must install the handler after we call sigset jmp. If not, we would run the 
risk of the handler running before the initial call to sigset jmp sets up the calling 
environment for siglongjmp. Second, you might have noticed that the sigset jmp 
and siglongjmp functions are not on the list of async-signal-safe functions in 
Figure 8.33. The reason is that in general siglongjmp can jump into arbitrary 
code, so we must be careful to call only safe functions in any code reachable from 
a siglongjmp. In our example, we call the safe sio puts and sleep functións. 
The unsafe exit function is unreachable. j 


8.7 Tools for Manipulating Processes 


Linux systems provide a number of useful tools for monitoring and manipulating 
processes: 


STRACE. Prints a trace of each system call invoked by a running program and 
its children. It is a fascinating tool for the curious student. Compile your 
program with -static to get a cleaner trace without a lot of output related 
to shared libraries. 


ps. Lists processes (including zombies) currently in the system. 
TOP. Prints information about the resource usage of current processes. 
PMAP. Displays the memory map of a process. 


/proc. A virtual filesystem that exports the contents of numerous‘kernel. data 
structures in an ASCII text form that can be read by user progràms. For 
example, type cat /proc/losdavg to see the current load average on | 
your Linux system. x 
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8.8 Summary 


Exceptional control flow (ECF) occurs at all levels of a computer system and is a 
basic mechanism for providing concurrency in a computer system, 

At the hardware level, exceptions are abrupt changes in the control flow that 
are triggered by events in the processor. The control flow passes to a software 
handler, which does some processing and then returns control to the interrupted 
control flow. 

There are four different types of exceptions: interrupts, faults, aborts, and 
traps. Interrupts occur asynchronously (with respect to any instructions) when 
an external I/O device such as a timer chip or a disk controller sets the in- 
terrupt pin on the processor chip:'Control returns to the instruction follow- 
ing the faulting instruction. Faults and aborts occur synchronously as the re- 
sult of the execution of an instruction. Fault handlers restart the faulting in- 
struction, while abort handlers never return control to the interrupted flow. 
Finally, traps are like function calls that are used to implement the system calls 
that provide applications with controlled entry points into the operating sys- 
tem code. 

At the operating system level, the kernel uses ECF to provide the funda- 
mental notion of a process. A process provides applications with two important 
abstractions: (1) logical control flows that give each program the illusion that it 
has exclusive use of the processor, and (2) private address spaces that provide the 
illusion that each program has exclusive use of the main memory. 

At the interface betwéen the operating system and applications, applications 
can create child processes, wait for their child processes to stop or terminate, run 
new programs, and catch signals from other processes. The semantics of signal 
handling is subtle and can vary from system to system. However, mechanisms exist 
on Posix-compliant systems that allow programs to clearly specify the expected 
signal-handling semantics. 

Finally, at the application level, C programs can use nonlocal jumps to bypass 
the normal call/return stack discipline and branch directly from one function to 
another. 
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Homework Problems 


89 € 
Consider four processes with the following starting and ending times: 


Process Starttime End time 
5 t 7 


vaw» 


2 
3 6 
1 

For each pair of processes, indicate whether they run concurrently (Y) or 
not (N): 


Process pair Concurrent? 
O P emamaa 
AB z 


8.10 9 
In this chapter, we have introduced some functions with unusual call and return 
behaviors: set jmp, longjmp, execve, and fork. Match each function with one of 
the following behaviors: 

A. Called once, returns twice 

B. Called once, never returns 

C. Called.once, returns-one or more times 
gii + l . 
How many “hello” output lines does this program print? 


Fat a CMM OE code/ecf/forkprobl.c 1 


1 #include "csapp.h" 

2 

3 int main() 

4 { 

5 int i; 

6 

7 for (i = 0; i < 2; itt) 
8 Fork(); 

9 printf ("hello\n"); 

10 exit (0); 


11 } 


maamaa code/ecf/forkprobl.c 
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8.12 € ; 
How many “hello” output lines does this program print? 
" 


MM —  —— codelecfiforkprob4.c 


i #include "csapp.h" 
2 i 
3 void doit() 
4 1 
5 Fork(); 
6 Fork(); 
7 printf ("hello\n"); 
8 return; 
9 } 
10 
i, int main() 
12 (t ' 
13 doit(); 
14 printf ("hello\n"); af "f 
15 exit(0); 
6 pà 
<< — —— — — codefecf/forkprob4.c 
8.13 € 


What is one possible output of the following program? i 


om — — —— codefecf/forkprob3.c 


1 include "csapp.h" 
2 ^ X 
3 int main() 

4 (t 

5 int x = 3; 

6 

7 if (ForkO != 0) 

8 printf ("x=%d\n", ++x); 
9 

10 printf ("x=%d\n", —x); 

n exit(0); 

12 Jj 


<< — — ——- codelecf/forkprob3.c 


8.14 9 


How many “hello” output lines does this program print? 
. "X 


Tm — — ——- codefecf/forkprobS.c 


finclude "csapp.h" 


1 
2 
3 void doit() n 
4 t 
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if (Fork() == 0) ( 
Fork() ; 
printf ("hello\n"); 
exit (0); 

} 

return; 


} 


int main() 

{ 
doit(); 
printf ("hello\n") ; 
exit (0); 


code/ecf/forkprob5.c 


8.15 9 
How many "hello" lines does this program print? 


code/ecf/forkprob6.c | 
#include "csapp.h" 


void doit() 
1 
if (ForkO == 0) ( 
Fork(); 
printf ("hello\n") ; 
return; 


1 
2 
3 
4 
5 
6 
7 
8 


} 


return; 


} 


int main() 

{ 
doit(); 
printf ("hello\n") ; 
exit(0); 


code/ecf/forkprob6.c | 


8.16 € 
What is the output of the following program? 


code/ecf/forkprob?.c 


#include "csapp.h" 
int counter - 1; 
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| 
4 int main( B 
s ( à 
6 if (fork() == 0) { 4 
7 counter--; Í 
8 exit (0); : 
9 h, " f Js l : 
10 else ( ig 
11 Wait (NULL) ; E 
12 printf ("counter = %d\n", ++counter); EN 
13 } [ i 
14 exit(0); £ [ | 
15 } ; 1 
wr — code/ecf/forkprob7.c q | 
i 
8.17 € : i 
Enumerate all of the possible outputs of the program in Practice Problem 8.4. ; 
8.18 o@ 4 
Consider.the following program: 4 
—_. a code/ecf/forkprob2.e P 
1 #include "csapp.h" | 1 
NEN ; 
3 , void end(yoid) d 
dE NC UR r 1 
5 printf("2"); fflush(stdout); iH 
2E | 
7, tj D " M d à E 
8 int main() " wp B 
9 { i ; 
10 if (ForkO -- 0) H 
11 atexit(end); E 
12 if (Fork() == 0) { ' d 
13 printf("O"); fflush(stdout); 1 1 
14 } i 
15 else { i 
16 printf("1"); fflush(stdout); f 1 
17 } | 
18 exit(0); B | 
19 ) 
a or code/ecf/forkprob2.¢ i | 
Determine which of the following outputs are possible. Note: The atexit l 
function takes a poititer to a function and adds it to a list of functions (initially J 
empty) that will be called when the exit function is called. j 
A. 112002 | ] 


B. 211020 





792 Chapter 8 Exceptional Control Flow 


C. 102120 
D. 122001 
E. 100212 


8.19 «9 
How many lines of output does the following function print? Give your answer as 
a function of n. Assume n > 1. 


M — í codefecf/forkprob8.c 


void foo(int n) 
{ 


int i; 


1 

2 

3 

4 

5 for (i = 0; i < n; itt) 
6 Fork(); 

7 printf ("hello\n"); 

8 exit (0); 

9 


} 


eh — —  codefecf/forkprob8.c 


8.20 99 

Use execve to write a program called myls whose behavior is identical to the 
/bin/1s program. Your program should accept the same command-line argu- 
ments, interpret the identical environment variables, and produce the identical 
output. 

The 1s program gets the width of the screen from the COLUMNS environ- 
ment variable. If COLUMNS is unset, then 1s assumes that the screen is 80 
columns wide. Thus, you can check your handling of the environment variables 
by setting the COLUMNS environment to something less than 80: 


linux? setenv COLUMNS 40 
linux» ./myls 


// Output is 40 columns wide 


linux» unsetenv COLUMNS 
linux? ./myls 


// Output is now 80 columns wide 


8.21 99 
What. are the possible output sequences from the following program? 


LLL — codefecf/waitprobác 


1 int main() 
2 t 
if (fork == 0) f 
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printf("a"); fflush(stdout) ; 
exit (0); 
} 
else { 
printf("b"); fflush(stdout); 
waitpid(-1, NULL, 0); 
} : 
printf("c"); fflush(stdout); 
exit(0); 


M — — —— code'ecf/waitprobs.c 


8.22 999 
Write your own version of the Unix system function 


int mysystem(char *command); 


The mysystem function executes command by invoking /bin/sh -c command, and 
then returns after command has completed. If command exits normally (by calling 
the exit function‘or executing a return statement), then mysystem returns the 
command exit status. For example, if command terminates by calling exit (8), then 
mysystem returns the value 8. Otherwise, if command teridinates abnormally, then 
mysystem returns the status returned by the shell. 


8.23 oo 

One of your colleagues is thinking of using signals to allow a parent process to 
count events that occur in a child process. The idea is to notify the parent each 
time an event occurs by sending it a signal and letting the parent's signal handler 
increment a global counter variable, which the parent can then inspect after the 
child has terminated. However, when he runs the test progrant in Figure 8.45 on 
his system, he discovers that when the parent calls printf, counter always has a 
value of 2, even though the child has sent five signals to the parent. Perplexed, he 


comes to you for help. Can you explain the bug? 
t 


824 99€ 
Modify the prograrh in Figure 8.18 so that the following two conditions are met:, 


1. ‘Each child terminates'abnormally after attempting to writé to a location in 
the read-only text segment. 

2. The parent prints output that is identical (except for the PIDs) to the fol- 
lowing: 


child 12255 terminated by Signal. 11: Segmentation fault 
child 12254 terminated by signal 11: Segmentation fault 


Hint: Read the man page for psignal (3). 
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i ST code/ecf/counterprob.c 


#include "csapp.h" 
int counter = 0; 


void handler(int sig) 

1 
countert*; 
sleep(1); /* Do some work in the handler */ 
return; 


main() 


PII 
wu MO 000 409€ RUwNC 


int i; 
Signal(SIGUSR2, handler); 


if (Fork() == 0) ( /* Child */ 
for (i = 0; i < 5; i++) { 
.Kill(getppidO , SIGUSR2) ; 
printf ("sent SIGUSR2 to parent \n"); 
} 
exit (0); 
} 


Wait (NULL) ; 
printf ("counter-AdNn" , counter); 
exit(0); 

29 } 


I code/ecf/counterprob.c 


Figure 8.45 Counter program referenced in Problem 8.23. 


8.25 499 

Write a version of the fgets function, called tfgets, that times out after 5 seconds. 
The tfgets function accepts the same inputs as fgets. If the user doesn’t type an : 
input line within 5 seconds, t£gets returns NULL. Otherwise, it returns a pointer | 
to the input line. 


8.26 999 
Using the example in Figure 8.23 as a starting point, write a shell program that i 
supports job control. Your shell should have the following features: 


e The command line typed by the user consists of a name and zero or more argu: | 
ments, all separated by one or more spaces. If name is a built-in command, the 
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shell handles it immediately and waits for the next command line. Otherwise, 
the shell assumes that name is an executable file, which it loads and runs in the 
context of an initial child process (job). The process group ID for the job is 
identical to the PID of the child. 


* Each job is identified by either a process ID (PID) or a job ID (JID), which 
is a small arbitrary positive integer assigned by the shell. JIDs are denoted on 
the command line by the prefix ‘%’. For example, ‘45’ denotes JID 5, and ‘5’ 
denotes PID 5. 


* If the command line ends with an ampersand, then the shell runs the job in 
the background. Otherwise, the shell runs the job in the foreground. 


* Typing Ctrl+C (Ctrl+Z) causes the kernel to send a SIGINT (SIGTSTP) signal 
to your shell, which then forwards it to every process in the foreground process 
group? 

* The jobs built-in command lists all background jobs. 


* The bg job built-in command restarts job by sending it a SIGCONT signal 
and then runs it in the background. The job argument can be either a PID or 
a JID. 


* The fg job built-in command restarts job by sending it a SIGCONT signal and 
then runs it in the foreground. 


* The shell reaps all of its zombie children. If any job terminates because it 
receives a signal that was not caught, then the shell prints a message to the 
terminal with the job's PID and a description of the offending signal. 


Figure 8.46 shows an example shell session. 


Solutions to Practice Problems 


Solution to Problem 8.1 (page 734) 

Processes A and B are concurrent with respect to each other, as are B and C, 
because their respective executions overlap—thafis, one process starts before the 
other finishes. Processes A and € are not concurrent because their executions do 
not overlap; A finishes before C begins. 


Solution to Problem 8.2 (page 743) 

In our example program in Figure 8.15, the parent and child execute disjoint sets 
of instructions. However, in this program, the parent and,child execute nondisjoint 
sets of instructions, which is possible because the parent and child have identical 
code segments. Thjs can be a difficult conceptual hurdle, so be sure you understand 
the solution to this problem. Figure 8.47 shows the process graph. 


2. Note that this is a simplification of the way that real shells work. With real shells, the kernel responds 
to Ctrl+C (Ctrl+Z) by sending SIGINT (SIGTSTP) directly to each process in the terminal foreground 
process group. The shell manages the membership.of this group using the tcsetpgrp function, and 
manages the attributes of the terminal using the tcsetattr function, both of which are outside the 
scope of this book. See [62] for details. 
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linux> ./shell Run your shell program 
>bogus 

bogus: Command not found. Execve can't find executable 
>foo 10 : 

Job 5035 terminated by. signal: Interrupt User types Ctrl+C 

>foo 100 & 

{1] 5036 foo 100 & 

>foo 200 & 

[2] 5037 foo 200 & 

>jobs 

[1] 5036 Running foo 100 & 

[2] 5037 Running foo 200 & 

>fg 41 

Job [1] 5036 stopped by signal: Stopped User types Ctrl*Z 
»jobs 

[1] 5036 Stopped foo 100 & 

[2] 5037 Running foo 200 & 

»bg 5035 l 

5035: No such process 

>bg 5036 

[1] 5036 foo 100 & 

»/bin/kill 5036 

Job 5036 terminated by signal: Terminated 

> fg %2 Wait for fg job to finish 
>quit 

linux» Back to the Unix shell | 


Figure 8.46 Sample shell session for Problem 8.26. 


Figure 8.47 pl: x72 pi: x=1 
Process graph for Practice 
Problem 8.2. 


Child 
printf printf exit 


p2: x=0 


Parent 
main fork printf exit 


A. The key idea here is that the child executes both printf statements. After | 
the fork returns, it executes the printf in line 6. Then it falls out of the it 
statement and éxecutes the printf in line 7. Here is the output produced by 
the child: ; 


pi: x=2 
p2: x^1 


ye 


. The parent executes only the printf in line 7: 


p2: x=0 





| 


3 
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Figure 8.48 *a c 
Process graph for Practice 
Problem 8.3. 
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Figure 8.49 1 Bye 
Process graph for Practice 
Problem 8.4. 
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printf 
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i yos 
Solution to Problem 8.3 (page 745) 
We Kiíów that the sequences atbc, abcc, and bacc'áre possible because“they 
corfespohd to topological ‘sorts of the process graph'(Figure 8:48). Howevér, 
Sequencés such' as beac and cbca:do hot Cüfreipónd to any topological sort and 
thus are not feasible. š a 


Solution to Problem 8.4 (page 748) 


H % 
A. We can determine the number of lines of output by simply counting the 
number of printf vertices in the process graph (Figure 8.49)..In this case, 
there are six such vertices, and thus the program will print six lines of output. 


B. Any output sequence corresponding to a topological sort of the graph is 
possible. For example: Helio: 1, 0; Bye,.2, Bye is possible. 


Solution to Problem 8.5 (page 750) 
m cee oe m code/ecf/snooze.c 


1 unsigned int snooze(unsigned int secs) ( , 

2 unsigned int rc = sleep(secs); 

3 f 

4 printf("Slept for %d of %d secs.Wn", Secs-rc, secs); 
5 return rc; 

6 ) 


Umm —————————————— code/ecf/snooze.c 
Solution to Problem 8.6 (page 752) 


1 

2 

3 int main(int argc, char *argv[], char *envp[1) 

4 4 i ` ^ 


int i; a 


printf("Command-line arguments: \n"); 


printf "fork printf waitpid printf 
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8 for (i-0; argv[i] != NULL; i++) 
^ printf(" argv[X2d]: %s\n", i, argv[iD; 
10 
11 printf ("a"); 
printf ("Environment variables:\n"); 
for (i=0; envp[i] != NULL; i++) 
printf (" envp[%2d]: %s\n", i, envpíil); 


exit(0); 


Solution to Problem 8.7 (page 764) 

The sleep function returns prematurely whenever the sleeping process receives a 
signal that is not ignored. But since the default action upon receipt of a SIGINT is 
to terminate the process (Figure 8.26), we must install a SIGINT handler to allow 
the sleep function to return. The handler simply catches the SIGNAL and returns 
control to the sleep function, which returns immediately. 


o Iaa codesecfisnooze.c 


#include "csapp.h" 


/* SIGINT handler */ 
void handler (int sig) 
{ 
return; /* Catch the signal and return */ 


} 


1 
2 
3 
4 
5 
6 
7 
8 
9 


unsigned int snooze(unsigned int secs) { 
unsigned int rc = sleep(secs); 


- 
e 


printf("Slept for %d of id secs.\n", secs-rc, secs); 
return rc; 


} 
int, main(int argc, char *xxargv) i 


if (arge t= 2) { 
fprintf(stderr, "usage: %s <secs>\n", argv[0]); 
exit(0); 

} 


if (signal (SIGINT, handler) == SIG_ERR) /* Install SIGINT */ 
unix error("signal error\n") ; /* handler */ 
(void) snooze (atoi (argv[1])); 
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exit (0); 
} 


sn — — codefecf/snooze.c 


Solution to Problem 8.8 (page 773) 


This program prints the string 213, which is the shorthand name of the CS:APP 
course at Carnegie Mellon. The parent starts by printing ‘2’, then forks the child, 
which spins in an infinite loop. The parent then sends a signal to the child and 
waits for it to terminate. The child catches the signal (interrupting the infinite 
loop), decrements the counter (from an initial value of 2), prints ‘1’, and then 
terminates. After the parent reaps the child, it increments the counter (from an 
initial value of 2), prints ‘3’, and terminates. 
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rocesses in a system share the CPU and main memory with other processes. 

However, sharing the main memory poses some special challenges. As demand 
on the CPU increases, processes slow down in some reasonably smooth way. But 
if too many processes need too much memory, then some of them will simply 
not be able to run. When a program is out of space, it is out of luck. Memory is 
also vulnerable to corruption. If some process inadvertently writes to the memory 
used by another process, that process might fail in some bewildering fashion totally 
unrelated to the program logic, 

In order to manage memory more efficiently and with fewer errors, modern 
systems provide an abstraction of main memory known as virtual memory (VM). 
Virtual memory is an elegant interaction of hardware exceptions, hardware ad- 
dress translation, main memory, disk files, and kernel software that provides each 
process with a large, uniform, and private address space. With one clean mech- 
anism, virtual memory provides three important capabilities: (1) It uses main 
memory efficiently by treating it as a cache for an address space stored on disk, 
keeping only the active areas in main memory and transferring data back and 
forth between disk and memory as needed. (2) It simplifies memory management 
by providing each process with a uniform address space. (3) It protects the address 
space of each process from corruption by other processes. 

Virtual memory is one of the great ideas in computer systems, A major reason 
for its success is that it works silently and automatically, without any intervention 
from the application programmer. Since virtual memory works so well behind the 
scenes, why would a programmer need to understand it? There are several reasons. 


* Virtual memory is central. Virtual memory pervades all levels of computer 
systems, playing key roles in the design of hardware exceptions, assemblers, 
linkers, loaders, shared objects, files, and processes. Understanding virtual 
memory will help you better understand how systéms work in general. 


Virtual memory is powerful. Virtual memory gives applications powerful ca- 
pabilities to create and destroy chunks of memory, map chunks of memory to 
portions of disk files, and share memory with other processes, For example, 
did you know that you can read or modify the contents of a disk file by reading 
and writing memory locations? Or that you can load the contents of a file into 
memory without doing any explicit copying? Understanding virtual memory 
will help you harness its powerful capabilities in your applications. 


Virtual memory is dangerous, Applications interact with virtual memory ev- 
ery time they reference a variable, dereference a pointer, or make a call to a 
dynamic allocation package such as malloc. If virtual memory is used improp- 
erly, applications can suffer from perplexing and insidious memory-related | 
bugs. For example, a program with a bad pointer can crash immediately with ; l 
a “segmentation fault” or a “protection fault,” run silently for hours before 4 
crashing, or scariest of all, run to completion with incorrect results. Under- i 
standing virtual memory, and the allocation packages such as malloc that 4 
manage it, can help you avoid these errors. i 
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This chapter looks at virtual memory from two angles. The first half-of the 
chapter describes how virtual memory works. The second half describes how 
virtual memory is used and managed by applications. There is no avoiding the 
fact that VM is complicated, and the discussion reflects this in places. The good 
news is that if you work through the details, you will be able to simulate the virtual 
memory mechanism of a small system by hand, and the virtual memory idea will 
be forever demystified. 

The second half builds on this understanding, showing you how to use and 
manage virtual memory in your programs. You will learn how to manage virtual 
memory via explicit memory mapping and calls to dynamic storage allocators such 
as the malloc package. You will also learn about a host of common memory- 
related errors in C programs and how to avoid them. 


9.1 Physical and Virtual Addressing 


(it 

The main memory of a computer system is organized as an array of M contiguous 
bytg-size cells. Each byte has a unique physical address (PA). The first byte has 
an address of 0, the.next byte an address of 1, the next byte an address of 2, 
and so, on. Given this simple organization; the most natural way for a CPU to 
access memory would be to use physical addresses. We call this apprqach physical 
addressing. Figure 9.1 shows an example of physical addressing in the context of 
a load instruction that reads the 4-byte word starting at physical address 4. When 
the CPU executes the load instruction, it generates an effective physical address 
and passes it to main memory over the memory bus. The main memory fetches the 
4-byte word starting at physical address 4 and returns it to the CPU, which stores 
it in a register. i 

Early PCs used physical addressing, and, systems such as digital signal pro- 
cessors, embedded microcontrollers, and Cray supercomputers continue to do so. 
However, modern processors use a form of addressing known as virtual address- 
ing, as shown in Figure 9.2, 


Figure 9.1 Main memory 
A system that uses : 
physical addressing. 


Physical 
address 
(PA) 


Data word 
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Figure 9.2 


A system that uses virtual 


addressing. 


CPU chip Main memory 


Address Physical 
address translation address 
(VA) 


4100 





Data word 


With virtual addressing, the CPU accesses main memory by generating à vi* 
tual address (VA), which is converted to the appropriate physical address before 
being sent to main memory. The task of converting a virtual address to a physical 
one is known as address translation. Like exception handling, address translation 
requires close cooperation between the CPU hardware and the operating sys- 
tem. Dedicated hardware on the CPU chip called the memory management unit 
(MMU) translates virtual addresses on the fly, using a lookup table stored in main 
memory whose contents are managed by the operating system. 


9.2 Address Spaces 


An address space is an ordered set of nonnegative integer addresses 
{0, 1, 2,...} 


If the integers in the address space are consecutive, then we say that it is a linear 
address space. To simplify our discussion, we will always assume linear address 
spaces. In a system with virtual memory, the CPU generates virtual addresses from 
an address space of N — 2" addresses called the virtual address space: 


(00,1,2, ...,N—1J 


The size of an address space is characterized by the number of bits that are 
needed to represent the largest address. For example, a virtual address space 
with N — 2" addresses is called an n-bit address space. Modern systems typically 
support either 32-bit or 64-bit virtual address spaces. 

A system also has a physical address space that corresponds to the M bytes of | 
physical memory in the system: 


{0,1,2,.... M-11} 


M is not required to be a power of 2, but to simplify the discussion, we will assume j 
that M =2”. 
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The concept of an addre’s:space is important because it makes a clean dis- 
tinction between data objects (bytes) and their attributes (addresses). Once we f f 
recognize this distinction, then we can generalize and allow each data object to 
have multiple independent addresses, each chosen from a different address space. 
This is the basic idea of virtual memory. Each byte of main memory has a virtual 
address chosen from the virtual address space, and a physical address chosen from 
the physical address space. 


plete the following table, filling in the missing entries and replacing each I 
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Conceptually, a virtual memory is organized as an array oE,N contiguous byte-size 
cells stored on disk. Each byte has a unique virtual address that serves as an index 
into the array: The contents of the array on disk are cached in main memory. As 
with any other cache in the memory-hierarchy, the data on disk (the lowerdevel) 
is partitioned into blocks that serve as the transfer units between the disk and 
the main memory (the upper level). VM systems handle, this by-partitioning the 
virtual memory into fixed-size blocks called virtual pages (VPs). Each virtual page | 
is P = 2? bytes in size. Similarly, physical memory is partitioned into physical pages 
(PPs), also P bytes in size. (Physical pages are also referred to as page frames.) 

.At any point in time, the set of virtual pages is partitioned into three disjoint 
subsets: 


Unallocated. Pages that have not yet been allocated (or created) by the VM 
system. Unallocated blocks do not have any data associated with them, 
and thus do not occupy any space on disk. 


Cached. Allocated pages that are currently cached in physical memory. 





Uncached. Allocated pages that are not cached in physical memory. 


The-éxample in Figure 9.3 shows a small virtual memory with eight virtual 
pages. Virtual pages 0 and 3 have not been allocated yet, and thus do not yet exist 
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Figure 9.3 Virtual memory Physical memory 

How a VM system uses VP 0 [Unallocated 9 : 

main memory as a cache. VP 1] Boedi PPO 
Y PP 1 




















Unallocated 
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8i PP 27"P.. 1 
VP 277? — 1| Uncac 
Virtual pages (VPs) Physica! pages (PPs) 
stored on disk cached in DRAM 


on disk. Virtual pages 1, 4, and 6 are cached in physical memory. Pages 2; S, and7 
are allocated but are not currently cached in physical memory. 


9.3.1 DRAM Cache Organization 


To help us keep the different caches in the memory hierarchy straight, we will use 
the term SRAM cache to denote the L1, L2, and L3 cache memories between the 
CPU and main memory, and the term DRAM cache to denote the VM system’s 
cache that caches virtual pages in main memory. 

The position of the DRAM cache in the memory hierarchy has a big impact 
on the way that it is organized. Recall that a DRAM is at least 10 times slower 
than an SRAM and that disk is about 100,000 times slower than a DRAM. Thus, 
misses in DRAM caches are very expensive compared to misses in SRAM caches 
because DRAM cache misses are served from disk, while SRAM cache misses are 
usually served from DRAM-based main memory. Further, the cost of reading the 
first byte from a disk sector is about 100,000 times slower than reading successive 
bytes in the sector. The bottom line is that the organization of the DRAM cache 
is driven entirely by the enormous cost of misses. 

Because of the large miss penalty and the expense of accessing the first byte, 
virtual pages tend to be large—typically 4KB to 2 MB. Due to the large miss 
penalty, DRAM caches are fully associative; that is, any virtual page can be placed 
in any physical page. The replacement policy on misses also assumes greater 
importance, because the penalty associated with replacing the wrong virtual page 
is so high. Thus, operating systems use much more sophisticated replacement 
algorithms for DRAM caches than the hardware does for SRAM caches. (These 
replacement algorithms are beyond our scope here.) Finally, because of the large 
access time of disk, DRAM caches always use write-back instead of write-through. 


9.3.2 Page Tables ` 


As with any cache, the VM system must have some way to determine if a virtual 
page is cached somewhere in DRAM. If so, the system must determine which 


physical page it is cached in. If there is a miss, the system must deterinine ] 
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Figure 9.4 Physical page Physical memory 


Page table. number or 
9 e disk address 


Virtual memory 
(disk) 


Memory-resident "-... 
page table 
(DRAM) 
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where the virtual page is stored on disk, select a victim page in physical memory, 
and copy the virtual page from disk to DRAM, replacing the victim page. 

These capabilities are provided by a combination of operating system soft- 
ware, address translation hardware in the MMU, (memory management unit), and 
a data structure stored iri physical memory known as a page table that maps vir- 
tual pages to physical pages. The address translation hardware reads the page table 
each time it converts a virtual address to a physical address. The operating system 
is responsible for maintaining the contents of the page table and transferring pages 
back and forth between disk and DRAM. 

Figuré 9.4 shows the basic organization of a page table. A page table is an array 
of page table entries (PTEs), Each page in the virtual áddress space lias a PTE at 
a fixed offset in the page table. For our purposes, we will assume that each PTE 
consists of a valid bit and an n-bit address field. The valid bit indicates whether 
the virtual page is currently cached in DRAM. If the valid bit is set, the address 
field indicates the start of the corresponding physical page in DRAM where the 
virtual page is cached. If the valid bit is not set, then a null address indicates that 
the virtual page has not yet been allocated. Otherwise, the address points to the 
start of the virtual page on disk. 

The example.in Figure 9.4 shows a page table for a system with eight virtual 
pages-and four physical pages. Four virtual pages (VP 1, VP 2, VP 4, and VP 7) 
are currently cached in DRAM. Two pages (VP 0 and VP 5) have not yet been 
allocated, and the rest (VP 3 and VP 6) have been allocated but are not currently 
cached. An important point to notice about Figure 9,4 is that because the DRAM 
cache is fully associative, any physical page can contain any virtual page. 


Determ er of page table entries (PTEs) that are needed for the 
following combinations of virtual address size (n) and page size (P): 





Number of PTEs 


9.3.3 Page Hits 


Consider what happens when the CPU reads a word of virtual memory contained 
in VP 2, which is cached in DRAM (Figure 9.5). Using a technique we will describe 
in detail in Section 9.6, the address translation hardware uses the virtual address 
as an index to locate PTE 2 and read it from memory. Since the valid bit is set, the 
address translation hardware knows that VP 2 is cached in memory. So it uses the 
physical memory address in the PTE (which points to the start of the cached page 
in PP 1) to construct the physical address of the word. 


9.3.4 Page Faults 


In virtual memory parlance, a DRAM cache miss is known as a page fault. Fig- 
ure 9.6 shows the state of our example page table before the fault. The CPU has 
referenced a word in VP 3, which is not cached in DRAM. The address transla- 
tion hardware reads PTE 3 from memory, infers from the valid bit that VP 3 is 
not cached, and triggers a page fault exception. The page fault exception invokes 
a page fault exception handler in the kernel, which selects a victim page—in this 
case, VP 4 stored in PP 3. If VP 4 has been modified, then the kernel copies it back 
to disk. In either case, the kernel modifies the page table entry for VP 4 to reflect 
the fact that VP 4 is no longer cached in main memory. 


Figure 9.5 Physical page Physical memory 
VM page hit. The reference Virtual address number or (DRAM) 
to a word in VP 2 is a hit. disk address 


Virtual memory 
(disk) 
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Figure 9.6 Physical page Physical memory 
VM page fault (before). Virtual address number or 

The reference to a word in Valid disk address 

VP 3 is a miss and triggers 

a page fault. 


Virtual memory 
(disk) 


(DRAM) 


Figure 9.7 . Physical page 
VM page fault (after). Virtual address number or 
,, disk address 
Fhe page fault handler. Valid" 
selects VP 4 as the victim 
and replaces it with a copy 
of VP 3 from disk. After the 
page fault handler restarts 
the faulting, instruction, it Virtual memory 
will read the word from m x (ien) 
memory:normally, without Vp ES S. 


generating an exception. Memory-resident "-...- 
page table 
(DRAM) 


' VP7 


Next, the kernel copies VP 3 from disk'to PP 3 in memory, updates PTE 3, 
and then returns. When thethandler returns, it restarts the faulting instruction, 
which resends the faulting virtual address to the address translation hardware. 
But now, VP 3 is cached in main memory, and the page hit is handled normally by 
the address translation hardware. Figure 9.7 shows the state of our example page 
table after the page fault. r 

Virtual memory was invented in the early 1960s, long before the widening 
CPU-memory gap spawned SRAMrçaches. As a result, virtual memory systems 
use a different terminology from SRAM caches, even.though many of the ideas 
are similar. In virtual memory parlance, blocks are known as pages. The activity 
of transferring a page between disk and memory is known as swapping or paging. 
Pages are swapped in (paged in) from disk to DRAM, and swapped out (paged 
out) from DRAM to disk. The strategy of waiting until the last moment to swap 
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Figure 9.8 
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in a page, when a miss occurs, is known as demand paging. Other approaches, such 
as trying to predict misses and swap pages in before they are actually referenced, 
are possible. However, all modern systems use demand paging. 


9.3.5 Allocating Pages 


Figure 9.8 shows the effect on our example page table when the operating System 
allocates a new page of virtual memory—for example, as a result of calling malloc, 
In the example, VP 5 is allocated by creating room on disk and updating PTE 5 
to point to the newly created page on disk. 


9.3.6 Locality to the Rescue Again 


When many of us learn about the idea of virtual memory, our first impression is 
often that it must be terribly inefficient. Given the large miss penalties, we worry 
that paging will destroy program performance. In practice, virtual memory works 
well, mainly because of our old friend locality. 

Although the total number of distinct pages that programs reference during an 
entire run might exceed the total size of physical memory, the principle of locality 
promises that at any point in time they will tend to work on a smaller set of active 
pages known as the working set or resident set. After an initial overhead where j 
the working set is paged into memory, subsequent references to the working set 4 
result in hits, with no additional disk traffic. 

As long as our programs have good temporal locality, virtual memory systems | 
work quite well. But of course, not all programs exhibit good temporal locality. If j 
the working set size exceeds the size of physical memory, then the program can 4 
produce an unfortunate situation known as thrashing, where pages are swapped in 
and out continuously. Although virtual memory is usually efficient, if a program's 
performance slows to a crawl, the wise programmer will consider the possibility 
that it is thrashing. 
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9.4 VM asa Tool for Memory Management 


In the last section, we saw how virtual memory provides a mechanism for using the 
DRAM to cache pages from a typically larger virtual address space. Interestingly, 
some early systems such as the DEC PDP-11/70 supported a virtual address space 
that was smaller than the available physical memory. Yet virtual memory was 
still a useful mechanism because it greatly simplified memory management and 
provided a natural way to,protect, memory. 

Thus far, we have assumed a single page table that maps a single virtual 
address space to the physical address space. In fact, operating systems provide 
a, separate page table, and thus a separate virtual address space, for each process. 
Figure 9.9 shows the basic idea. In the example, the page table for process i ‘maps 
VP 1 to PP 2 and VP 2 to PP 7. Similarly, the page table for process j maps VP 1 
to PP 7 and VP 2 to PP 10. Notice that multiple virtual pages can be mapped to 
the same sharéd physical page. 

The'coníbination of demand paging and separate virtual address spaces has 
a profound impact on the way that memory is used and managed in a system. In 
párticular, VM simplifies linking and loading, the sharing of code and data, and 
allocating memory to applications. 


* Simplifying linking. A separate address space allows each process to usé the 
same basic format for its memory image, regardless of where the code and data 
actually reside in physical memory. For example, as we saw in Figure 8.13,.ev- 
ery process on a given Linux system has a similar memory format. For 64-bit 
address spaces, the code segment always starts at virtual address 0x400000. 
The data segment follows the code segment after a suitable alignment gap. 
The stack occupies the highest portion of the user process address space and 





812 Chapter9 Virtual Memory 


grows downward. Such uniformity greatly simplifies the design and implemen- 
tation of linkers, allowing them to produce fully linked executables that are 
independent of the ultimate location of the code and data in physical memory, 


e Simplifying loading. Virtual memory also makes it easy to load executable 
and shared object files into memory. To load the . text and . data sections of 
an object file into a newly created process, the Linux loader allocates virtual 
pages for the code and data segments, marks them as invalid (i.e., not cached), 
and points their page table entries to the appropriate locations in the object 
file. The interesting point is that the loader never actually copies any data 
from disk into memory. The data are paged in automatically and on demand 
by the virtual memory system the first time each page is referenced, either by 
the CPU when it fetches an instruction or by an executing instruction when it 
references a memory location. 

This notion of mapping a set of contiguous virtual pages to an arbitrary 
location in an arbitrary file is known as memory mapping. Linux provides 
a system call called mmap that allows application programs to do their own 
memory mapping. We wil! describe application-level memory mapping in 
more detail in Section 9.8. 


e Simplifying sharing. Separate address spaces provide the operating system 
with a consistent mechanism for managing sharing between user processes 
and the operating system itself. In general, each process has its own private 
code, data, heap, and stack areas that are not shared with any other process. In 
this case, the operating system creates page tables that map the corresponding 
virtual pages to disjoint physical pages. 

However, in some instances it is desirable for processes to share code 
and data. For example, every process must call the same operating system 
kernel code, and every C program makes calls to routines in the standard C 
library such as printf. Rather than including separate copies of the kernel 
and standard C library in each process, the operating system can arrange 
for multiple processes to share a single copy of this code by mapping the 
appropriate virtual pages in different processes to the same physical pages, 
as we saw in Figure 9.9. 
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e Simplifying memory allocation. Virtual memory provides a simple mechanism 
for allocating additional memory to user processes, When a program running 
in a user process requests additional heap space (e.g., as a result of calling 
malloc), the operating system allocates an appropriate number, say, k, of 
contiguous virtual memory pages, and maps them to k arbitrary physical pages 
located anywhere in physical memory. Because of the way page tables work, 
there is no need for the operating system to locate k contiguous pages of 
physical memory. The pages can be scattered randomly in physical memory. 


ec ran a nm 


9.5 VM as a Tool for Memory Protection 


Any modern computer system must provide the means for the operating system 
to control access to:the memory system. A user process should not be allowed 
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to modify its read-only code section. Nor should it be allowed to read or modify 
any of the code and data structures in the kernel. It should notbe allowed to réad 
or write the private memory of other processes, and it should not be allowed to 
modify any virtual pages that are shared with other processes, unless all parties 
explicitly allow it (via calls to explicit interprocess communication systém calls). 

As we have seen, providing separate virtuál address spacés'fnàkes it easy to 
isolate the private memories of different processes. But the address translation 
mechanism can be extended in a natural way to provide even finer access control. 
Since the address translation hardware reads.a PTE each time the CPU generates. 
an address, it is straightforward to control access to the contents of a,virtual page 
by adding some additional permission bits to the PTE. Figure 9.10 shows the 
general idea. 

dn this example, we have added three permission bits to each PTE. The SUP bit 
indicates whether processes must be running in kernel (supérvisor) mode to access 
the page. Processes running in kernel mode can access any page, but processes 
running in user mode are only allowed to.access pages for which SUP.is 0. The 
READ and WRITE bits control read and write access to the page. For example, 
if process i is running in user mode, then it has permission to read VP 0 and to 
read or write VP 1. However, it is not allowed to access VP 2. 

‘If an instruction violates thesé permissions, then the CPU triggérs a general 
protection fault that transfers control to an exception handler in the.kernel, which 
sends a SIGSEGV signal to the offending process. Linux shells typically report this 
exception as a "segmentation fault." Wea 

« 


+ 


9.6 Address Translation ! 


1 a 
Fhis section covers the basics-of address translation. Our aim'is to give you an 
appreciation of the hardware’s role in supportingsvirtual memory, with.enough 
detail so.that you can work through some concrete.examples by hand. However, 
keepin mind that we are omitting’ a number of details, especially related to timing, 
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ooo e a mamaaa maumau 
Symbol Description 

Basic parameters 

N=2" Number of addresses in virtual address space 
M-2" Number of addresses in physical address space 
P=2? Page size (bytes) 


Components of a virtual address (VA) 
VPO Virtual page offset (bytes) 
VPN Virtual page number 
TLBI TLB index 

TLBT TLB tag 


Components of a physical address (PA) 
PPO Physical page offset (bytes) 
PPN Physical page number 

CO Byte offset within cache block 
CI Cache index 

CT Cache tag 


Figure 9.11 Summary of address translation symbols. 


that are important to hardware designers but are beyond our scope. For your 
reference, Figure 9.11 summarizes the symbols that we will be using throughout 
this section. 

Formally, address translation is a mapping between the elements of an N- 
element virtual address space (VAS) and an M-element physical address:space 
(PAS), 


MAP: VAS -> PASU @ 


where 


A' if data at virtual addr. A are present at physical addr. A’ in PAS 
® if data at virtual addr. A are not present in physical memory 


MAP(A) = | 


Figure 9.12 shows how the MMU uses the page table to perform this mapping. 
A control register in the CPU, the page table base register (PTBR) points to the 
current page table. The n-bit virtual address has two components: a p-bit virtual 
page offset (VPO) and an (n — p) -bit virtual page number (VPN). The MMU uses 
the VPN to select the appropriate PTE. For example, VPN Oselects PTE 0, VPN1 
selects PTE 1, and so on. The corresponding physical address is the concatenation 
of the physical page number (PPN) from the page table entry and the VPO from 
the virtual address. Notice that since the physical and virtual pages are both P 
bytes, the physical page offset (PPO) is identical to the VPO. 
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‘Virtual address 
Page table p pi 
base register Virtual page offset (VPO) 
(PTBR) 


[zr] 
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the page tabie 


If valid = 0, 
then page 
not in memory p p-i 


(page fault) Physical page number (PPN) | Physical page offset (PPO) 


Physical address 


Figure 9.12 Address translation with a page table. 


Figure 9.13(a) shows the steps.that the CPU hardware performs when there 
is a page hit. 


Step 1. The processor generates a virtual address and sends it to the MMU. 


Step 2. The MMU generates the PTE address and requests it from the cache/ 
main memory. " 


Step 3. The cache/main memory returii$ the PTE to the MMU. 


Step 4. The MMU constructs the physical address and sends it to the cache/main 
memory. 


Step 5. The cache/main memory returns the requested data word to the pro- 
cessor. 


Unlike a page hit, which is handled entirely by hardware, handling a page 
fault requires cooperation between hardware and the operating system kernel 
(Figure 9.13(b)). 

f 
Steps 1 to 3. The same as steps 1 to 3 in Figure 9;13(a). 


Step 4. The valid bit in the PTE is Zero, so theMMU triggers an exception, 
which transfers control in the CPU to a page fault exception handler in 
the operating system kernel. 


Step 5. The fault handler identifies a victim page in physical memory, and if that 
page has been modified, pages it out to disk. 


Step 6. The fault handler Pages in the new page and updates the PTE in memory. 


815 
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CPU chip 
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Processor 


Data 


(a) Page hit 


Exception g 
Page fault exception handler’ 
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B 
vu" 


Victim page 


Cache/ ® 


tas New page 


Processor 


(b) Page fault 


Figure 9.13 Operational view of page hits and page faults. VA: virtual address. PTEA: 
page table entry address. PTE: page table entry. PA: physical address. 


Step 7. The fault handler returns to the original process, causing the faulting 
instruction to be restarted. The CPU resends the offending virtual address 
to the MMU. Because the virtual page is now cached in physical memory, 
there is a hit, and after the MMU performs the steps in Figure 9:13(a), the 
main memory returns the requested word to the processor. 


i 
lessen angg a, vae Bir CRMC cPLA EINA s Dpto E vs Weya, A g ga i one 
‘Practice Problem 9:3 «solütion page 881) :9 2 csp. c sacred dan 


Given a 32-bit virtual address space and a 24-bit physical address, determine the 
number of bits in the VPN, VPO, PPN, and PPO for the following page sizes P: 


Number of 
VPN bits VPO bits PPN bits PPO bits 
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Figure 9.14 Integrating VM with a physically addressed cache. VA: virtual address. 
PTEA: page table entry address. PTE: page table entry. PA: physical address. 


9.6.1 Integrating Caches and VM 


In any system that uses both virtual memory and SRAM caches, there is the 
issue of whether to use virtual or physical addresses to access the SRAM cache. 
Although a detailed discussioh of the trade-offs is beyond our scope here, most 
systems opt for physical addressing. With physical addressing, it is straightforward 
for multiple processes to have blocks in the cache at the same time and to share 
blocks from the same virtual pages. Further, the cache does not have to deal 
with protection issues, because access rights are checked as part of the address 
translation process. 

Figure 9.14 shows how a physically addressed cache might be integrated with 
virtual memory. The main idea is that the address translation occurs before the 
cache lookup. Notice that page table éntries can be cached, just like any other 
data words. 


9.6.2 Speeding Up Address Translation with a TLB 


As we have seen, every time the CPU generates a virtual address, the MMU must 
refer to a PTE in order to translate the virtual address into a physical address. In 
the worst case, this requires an additional fetch from memory, at a cost of tens to 
hundreds of cycles. If the PTE happens to be cached in LI, then the cost goes down 
to a handful of cycles. However, many systems try to eliminate even this cost by 
including a small cache of PTEs in the MMU called a translation lookaside buffer 
(TLB); d 

! A TLB is a small, virtually addressed cache where’each line holds a block 
consisting of a single PTE. A TLB usually has a high degree of assoCiativity. As 
shown in Figure 9.15, the index and tag fields that are used for set selection and line 
matching are extracted from the virtual page number in the virtual address. If‘the 
TLB has T = 2! sets,then the TLB index (TLBI) consists of the t least significant 
bits of the VPN, and the TLB tag (TLBT)consists of the remaining bits in the VPN. 
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Figure 9.15 
Components of a virtual 
address that are used to 
access the TLB. 


Figure 9.16 
Operational view of a TLB 
hit and miss. 


n-i ptt prt-i pp-1i 0 
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(b) TLB miss 


Figure 9.16(a) shows the steps involved when there is a TLB hit (the usual . 
case). The key point here is that all of the address translation steps are performed 
inside the on-chip MMU. and thus are fast. 


Step 1. The CPU generates a virtual address. » 
Steps 2 and 3. The MMU fetches the appropriate PTE from the TLB. 
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Step 4. The MMU translates the virtual address to a physical address and sends 
it to the cache/main memory. 


Step 5. The cache/main memory returns the requested data word to the CPU. 


When there is a TLB miss, then,the MMU must fetch the PTE from the L1 
cache, as shown in Figure 9.16(b). The newly fetched PTE is stored in the TLB, 
possibly overwriting an existing entry. 


9.6.3 Muiti-Level Page Tables 


Thus far, we have assumed that the system uses a single page table to do address 
translation. But if we had a 32-bit address space, 4 KB pages, and a 4-byte PTE, 
then we would need a 4 MB page table resident in memory at all times, even if 
the application referenced only a small chunk of the virtual address space. The 
problem is compounded for systems with 64-bit address spaces. 

The common approach for compacting the page table is to use a hierarchy 
of page tables instead. The idea is easiest to understand with a concrete example. 
Consider a 32-bit virtual address space partitioned into 4 KB pages, with page 
table entries that are 4 bytes each. Suppose also that at this point in time the virtual 
address space has the following form: The first 2 K pages of memory are allocated 
for code and data, the next 6 K pages are unallocated, the next 1,023 pages are also 
unallocated, and the next page is allocated for the user stack. Figure 9.17 shows 
how we might construct a two-level page table hierarchy for this virtual address 
space. 

Each PTE in the level 1 table is responsible for mapping a 4 MB chunk of the 
virtual address space, where each chunk consists of 1,024 contiguous pages. For 
example, PTE 0 maps the first chunk, PTE 1 the next chunk, and so on. Given that 
the address space is 4 GB, 1,024 PTEs are sufficient to cover the entire space. 

If every page in«chunk i is unallocated, then level 1 PTE is null. For example, 
in Figure 9.17, chunks 2-7 are unallocated. However, if at least one page in chunk 
i is allocated, then level 1 PTE i points to the base of a level 2 page table. For 
example, in Figure 9.17, all or portions of chunks 0, 1, and 8 are allocated, so their 
level 1 PTEs point to level 2 page tables. 

Each PTE in a level 2 page.table is responsible for mapping a 4-KB page of 
virtual memory, just as before when we looked at single-level page tables, Notice 
that with 4-byte PTEs, each level 1 and level 2 page table is 4 kilobytes, which 
conveniently is the same size as a page. 

This scheme reduces memory requirements in two ways. First, if a PTE in the 
level 1 table is null, then the corresponding level 2 page table does not even have 
to exist. This represents a significant potential savings, since most of the 4 GB 
virtual address space for à typical program is unallocated. Second, only the [evel 
1 table needs to be in main memory at all times. The level 2 page tables can be 
created and paged in and out by the VM system as they are needed, which reduces 
presstire on main memory. Only the most heavily used level 2 page tables need to 
be cached in ‘main memory. 
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Figure 9.17 A two-level page table hierarchy. Notice that addresses increase from AE F 
top to bottom. ; x 
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Figure 9.18 Address translation with a k-level page table. 


Figure 9.18 summarizes address translation with a k-level page table hierarchy. i 
The virtual address is partitioned into k VPNs and a VPO. Each VPNi,1 <i <k, 
is an index into a page table at level i. Each PTE in a level j table,1< j sk —1, 
points to the base of some page table at level j + 1. Each PTE in a level k table ] 
contains either the PPN of some physical page or the ‘address of a disk block. | 
To construct the physical address, the MMU must access k.PTEs.before it can MANN 
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determine the PPN. As with a single-level hierarchy, the PPO is identical to the 
VPO. 

Accessing k PTEs may seem expensive and impractical at first glance. How- 
ever, the TLB comes to the rescue here by caching PTEs from the page tables at 
the different levels. In practice, address translation with multi-level page tables is 
not significantly slower than with single-level page tables. 

t 


9.6.4 Putting It Together: End-to-End Address Translation 


In this section, we put it all together with a concrete example of end-to-end 
address translation on a small system with a TLB and L1 d-cache. To keep things 
manageable, we make the following assumptions: 


* The memory is byte addressable. 

* Memory accesses are to 7-byte words (not 4-byte words). 
* Virtual addresses are 14 bits wide (n — 14). 

* Physical addresses are 12 bits wide (m — 12). 

* The page size is 64 bytes (P = 64). 

* The TLB is 4-way set associative with 16 total,entries. 


* The L1 d-cache is physically addressed and direct mapped, with a 4-byte line 
size and 16 total sets. 


Figure 9.19 shows the formats of the virtual and physical addresses. Since each 
page is 2° = 64 bytes, the low-order 6 bits of the virtual and physical addresses serve 
as the VPO and PPO, respectively. The high-order 8 bits of the yirtual address 
serve as the VPN. The high-order 6 bits of the physical address serve as the PPN. 

Figure 9.20 shows a snapshot of our little memory system, including the TLB 
(Figure 9.20(a)), a portion of the page table (Figure 9.20(b)), and the L1 cache 
(Figure 9.20(c)). Above the figures of the TLB and cache, we have also shown 
how the bits of the virtual and physical addresses are partitioned by the hardware 
as it accesses these devices. 
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Figure 9.19 Addressing for small memory system. Assume 14-bit virtual addresses 
(n = 14), 12-bit physical addresses (m = 12), and-64*byte pages (P = 64). 
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Figure 9.20 TLB, page table, and cache for small memory system. All values in the 4 5 
TLB, page table, and cache are in hexadecimal notation. j 
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TLB. The TLB is virtually addressed using the bits of the VPN. Since the TLB 
has four sets, the2 low-order bits of the VPN serve as the set index (TLBI). 
The remaining 6 high-order bits serve as the tag (TLBT) that distinguishes 
the different VPNs that might map to the same TLB sët. 


Page table. The page table is a single-level design with a total of 28 = 256 page 
table entries (PTEs). However, we are only interested in the first 16 of 
these. For convenience, we have labeled each PTE with the VPN that 
indexes it; but keep in mind that these YPNs are not part of the page 
table dnd not stored in memo . Also, 1 notice ‘that the PPN of each invalid 
PTE is denoted with a dash tó reinforce the idea that whatever bit values 
might happen to be stored there.are not meaningful. 


Cache. The direct-mapped-cache is addressed by the, fields in the physical 
address. Since each block is 4 bytes, the low-order 2 bits of the physical 
address serve as the block offset (CO). Since there are 16,sets, the next 4 
bits serve as the set index (CI). The remaining 6 bits serve as the tag (CT). 


Given this initial setup, let’s see what happens,when the CPU executes a load 
instruction that reads the byte at address 0x03d4. (Recall that our hypothetical 
CPU reads 1*byte Words father than 4-byte words.) To begin.this kind of manual 
simulation, we find it helpful to write down the bits in the virtual address, idéntify 
the various fields we will need, and determine their hex values, Thé hardware 
performs a sitiilar task when. it decodes the address. 
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To begin, the MMU extracts the VPN (0x0F) from the virtual address and 
checks with the TLB to see if it has cached a copy of PTE 0x0F from some previous 
memory reference. The TLB extracts the TLB index (0x03) and the TLB tag (0x3) 
from the VPN, hits on a valid match in the second entry of set 0x3, and.returns 
the cached PPN (0x0D) to the MMU. 

Ifthe TLB had missed, then the MMU would need to fetch the PTE from main 
memory. However, in this case, we got lucky and had a TLB hit. Tae MMU now 
has everything it needs to form the physical address, It does this by concatenating 
the PPN (0x0D) from the PTE with the VPO (0x14) from the virtual address, which 
forms the physical address (0x354). 

Next, the MMU sends the physical address to the cache, which extracts the 
cache offset CO (0x0), the cache set index CI (0x5), and the cache tag CT (0x0D) 
from the physical address. 
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Since the tag in set 0x5 matches CT. the cache detects a hit, reads out the data 
byte (0x36) at offset CO, and returns it to the MMU, which then passes it back to 
the CPU. * 

Other paths through the translation process are also possible. For example, if 
the TLB misses, then the MMU must fetch the PPN from a PTE in the page table. 
If the resulting PTE is invalid, then there is a page fault and the kernel must page 
in the appropriate page and rerun the load instruction. Another possibility is that 
the PTE is.valid, but the necessary memory block misses in the cache. 


Show how the example memory system in Section 9.6.4 translates a virtual address 
into a physical address and accesses the cache. For the given virtual address, 
indicate the TLB entry accessed, physical address, and cache byte value returned. 
Indicate whether the TLB misses, whether a page fault occurs, and whether a cache 
miss occurs. If there is a cache miss, enter "—" for *Cache byte returned." If there 
is a page fault, enter «__” for “PPN” and leave parts C and D blank. 





Virtual address: 0x03d7 


A. Virtual address format 
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B. Address translation 


Parameter Value 
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VPN puc 
TLB index ITE 
TLB tag uc 
TLB hit? (Y/N) am ; i 
Page fault? (Y/N) = ——— — 
PPN eS 


C. Physical address format 
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9.7 Case Study: The Intel Core i7/Linux Memory System 





We concludé^our discussion of virtual memory mechanisms with a case study of 
a real-system: an-Intel Core i7 running Linux. Although the underlying Haswell 
microarchitecture allows for full 64-bit virtual and physical address spaces, the 
current Core i7 implementations (and those for the foreseeable future) support a 
48-bit (256 TB) virtual address space and a 52-bit (4. PB) physical address space, 
along with a compatibility mode that supports 32-bit (4 GB) virtual and physical 
address spacés. 

Figure 9.21 gives the highlights of the Core i7 memory system. The processor 
package (chip) includes four cores, a large L3 cache shared by all of the cores, and 
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Figure 9.22 Summary of Core i7 address translation. For simplicity, the i-caches, 
i-TLB, and L2 unified TLB are not shown. 


a DDR3 memory controller. Each core contains a hierarchy of TLBs, a hierarchy 
of data and instruction caches, and a set of fast point-to-point links, based on the 
QuickPath technology, for communicating directly with the other cores and the 
external I/O bridge. The TLBs are virtually addressed, and 4-way set associative. 
The L1, L2, and L3 caches are physically addressed, with a block size of 64 bytes. 
L1 and L2 are 8-way set associative, and L3 is 16-way set associative. The page 


size can be configured at start-up time as either 4 KB or 4 MB. Linux uses 4 KB 
pages. 


9.7.1 Corei7 Address Translation 


Figure 9.22 summarizes the entire Core i7 address translation process, from the 1 
time the CPU generates a virtual address until a data word arrives from memory. į 
The Core i7 uses a four-level page table hierarchy. Each process has its own private 
page table hierarchy. When a Linux process is running, the page;tables associated 
with allocated pages are all memory-resident, although the Core i7 architecture | 
allows these page tables to be swapped in and out. The CR3 control register 
contains the physical address of the beginning of the level 1 (L1) page table. The 
value of CR3 is part of each process context, and is restored during each context 4 


switch. 
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63 62 §2 51 1211 1 q 


Available for OS (page table location on disk) P=0 q 


Field Description | i i 
P Child page table present in physical memory (1) or not (0). f, a | 
R/W Read-only or read-write access permission for all reachable pages. ! d | 
U/S User or supervisor (kernel) mode access permission for all reachable pages. ' | 
WT Write-through or write-back cache policy for the child page table. Lo! 
CD Caching disabled or enabled for the child page table. 

A Reference bit (set by MMU on reads and writes, cleared by software). 

PS Page size either 4 KB or 4 MB (defined for leve] 1 PTEs only). 

Base addr 40 most significant bits of physical base address of child page table. 

XD . Disable or enable instruction fetches from all pages reachable from this PTE. 


Figure 9.23 Format of level 1, level 2, and level 3 page table entries. Each entry 
references a 4 KB child page table. 


Figure 9:23 shows the format of an entry in a Jevel 1, level 2, or level 3 
page table. When P = 1 (which is always the case with Linux), the address field 
contains a 40;bit physical page number (PPN) that points.to the beginning of the 
appropriate page table. Notice that this,imposes a 4 KB alignment requirement 
on page tables. 

s Figure 9.24 shows the format of an entry in a leve] 4 page table. When P=1, 
the address field contains a 40-bit PPN that ppints to the base of some page in 
physical memory. Again, this imposes a 4 KB alignment requirement on physical 
pages, 

The PTE has three permission bits that control access to the page. The R/W bit 
determines whether the contents of a page are read/write or read-only, The U/S 
bit, which determines whether the page can be accessed in user,mode; protects 
code,and.data in the operating system kernel from user programs. The XD (exe- 
cute disable) bit, which was introduced in 64-bit systems, can be used:to disable 
instruction.fetches from individual memory pages. This is an important new fea- 
ture that allows the operating system kernel to reduce the risk of buffer overflow 
attacks by restricting execution to the read-only code segment. , 

r As the MMU translates each virtual address, it also updates two otherbits that 
cai be used by the kernel's page fault handler. The MMU sets.the Abit, which 
is known as a reference bit, each.time a page is accessed. The kernel can use the 
reference bit to implement its page replacement algorithm. The MMU sets the D 
bit, or dirty bit, each time the page is written to. A page that has been modified is 
sometimes called a dirty page. The dirty bit tells the kernel whether or not it must 
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63 62 52 51 12 0 


1 8 8 7 6 5 4 3 2 1 
rag pracara csr ume o [0 [o^ [eo]vrjospwees 
Available for OS (page table location on disk) P=0 





Field Description 

P Child page present in physical memory (1) or not (0). 

R/W Read-only or read/write access permission for child page. $ 
U/S User or supervisor mode (kernel mode) access permission for child page. 
WT Write-through or write-back cache policy for the child page. 

CD Cache disabled or enabled. 

A Reference bit (set by MMU on reads and writes, cleared by software). 
D Dirty bit (set by MMU on writes, cleared by software). 

G Global page (don't evict from TLB on task switch). 

Base addr 40 most significant bits of physical base address of child page. 

XD Disable or enable instruction fetches from the child page. 


Figure 9.24 Format of level 4 page table entries. Each entry references a 4 KB child 
page. 


write back a victim page before it copies in a replacement page. The kernel can 
call a'special kernel-mode instruction to clear the reference or dirty bits. 

Figure 9.25 shows how the Core i7 MMU uses the four levels of'page tables 
to translate a virtual address to a physical address. The 36-bit VPN is partitioned 
into four 9-bit chunks, each of which is used as an offset into a page table. The 
CR3 register contains the physical address of the L1 page table. VPN 1 provides 
an offset toan L1 PTE, which contains the base address of the L2 page table. VP 
2 provides an offset to an L2 PTE, and so on. 


9.7.2 Linux Virtual Memory System 


A virtual.memory system requires close cooperation between the hardware and 
the kernel. Details vary from version to version, and a complete description is 
beyond our scope. Nonetheless, our aim in this section is to describe enough of 
the Linux virtual memory system to give you a sense of how a real operating system 
organizes virtual memory and how it handles page faults. 7 

Linux maintains a separate virtual address space for each process of the form 
shown in Figure 9.26. We have seen this picture a number of times already, with 
its familiar code, data, Heap, shared library, and stack segments. Now that'we 
understand address translation, we can fill in some more details about the kernel 
virtual memory that lies'àbove the user stack. 

The kernel virtual memory contains the code and data structures in the kernel. 
Some regions of the kernel virtual memory are mapped.to physical pages that 
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Aside Optimizing address translation 


In our discussion of address translation, we*have, described a sequential two-step’ process where the 
r š 


MMU (1) translates the virtual address to a physical address and then (2 passes the physical address 
to the L1 cache. However, real hardware implementations use 4 neat trick that allows these steps-to 
be partially overlapped, thus speeding up accesses;to the Li cache, For example, a virtual address on 
a Core i7 with 4 KB pages has 12 bits of VPÓ, and these bits are identical to the 12 bits of PPO in the 
corresponding physical address. Since the 8-way sét'associative physically ‘addressed Lil caches have 
64 sets and 64-byte cache blocks, each physical address has 6log 64) cache offset bits and 6 (logo 64) 
index bits. These 12 bits fit exactly.in the 12-bit VPO of a virtual address, which is no accident! When 


the CPU needs a virtual address translated, ‘it sends the VPN to the MMU and the VPO to the L1 


cache. While the MMU is requesting a page-table entry from the TLB, the L4 cache is busy using the 
VPO bits to find the appropriate set and read out’the eight tags arid corresponding data*words in that 


x 


x 


set. When the MMU gets the PPN back from the‘TLB,.the cache is ready to try to match the PPN to | 


one of thése eight tags. 





are shared by all processes. For example, each process shares the kernel’s code 
and global data structures. Interestingly, Linux also maps a set of contiguous 
virtual pages (equal in size to the total amount of DRAM in the system) to the 
corresponding set of contiguous physical pages. This provides the kernel with a 
convenient way to access any specific location in physical memory—for example, 
when it needs to access page tables or to perform memory-mapped I/O operations 
on devices that are mapped to particular physical memory locations. 

Other regions of kernel virtual memory contain data that differ for each 
process. Examples include page tables, the stack that the kernel uses when it is 
executing code in the context of the process, and various data structures that keep 
track of the current organization of the virtual address space. 


Linux Virtual Memory Areas 


Linux organizes the virtual memory as a collection of areas (also called segments). 
An area is a contiguous chunk of existing (allocated) virtual memory whose pages 
are related in some way. For example, the code segment, data segment, heap, 
shared library segment, and user stack are all distinct areas. Each existing virtual 
page is contained in some area, and any virtual page that is not part of some area 
does not exist and cannot be referenced by the process. The notion of an area is 
important because it allows the virtual address space to have gaps. The kernel does 
not keep track of virtual pages that do not exist, and such pages do not consume 
any additional resources in memory, on disk, or in the kernel itself. 

Figure 9.27 highlights the kernel data structures that keep track of the virtual 
memory areas in a process. The kernel maintains a distinct task structure (task. 
struct in the source code) for each process in the system. The elements of the task 
structure either contain or point to all of the information that the kernel needs to 
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Figure 9.27 How Linux organizes virtual'memory. 
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fun the process ‘(e.g., the PID, póintéf'to the user stack, name of the executable 
object file, and program counter), : 

One of the entries in thé task stfucture points to ati rn struct that charac- 
térizes the current state of the virtual.niemory. The two fields of interest to us 
are ped, Which points to the base of the level 1'table (the page global directory), 
and mmap, which’ points to a list of vmLarea_structs (area structs), each of which 
characterizés hn alea of the current virtual address space. When the Kernel runs 
this process, it stores pgd in the CR3 control régister. 

For our purposés, the area struct for a particulak area contains the following 

i 


fields: 


4 


ifa 


fvn start. Points to, the beginning of the area. 

vm:.end. Points to the end of the area. 

vm, prot. Describes the read/write permissions for all of the pages contained 
‘in the area. ` 

vm. flags. Describes (among other things) whether the pages in the area are 
shared with other processes or private,to this process. 


vn:next. Points tø the next atea struct in the list. 
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Figure 9.28 qucsredistiudt Process virtual memory 
-area. 


Linux page fault handling. 







|g 


i «Shared tibratieg z 
a st s Segmentation fault: 
© Accessing a nonexistent page 
|o 


© Normal page fault 





Protection exception 

(e.g., violating permission by 
writing to a read-only page) 
Penne 











Linux Page Fault Exception Handling 


Suppose the MMU triggers a page fault while trying to translate some virtual 
address A. The exception results in a transfer of control to the kernel’s page fault 


handler, which then performs the following steps: 


words, does A lie within an area defined by 
some area struct? To answer this question, the fault handler searches the list of 
area structs, comparing A with the vm_start and vm_end in each area struct. 
If the instruction is not legal, then the fault handler triggers a segmentation 
fault, which terminates the process. This situation is labeled “1” in Figure 9.28. 
Because a process can create an arbitrary number of new yirtual memory 
areas (using the mmap function described in the next section), a sequential 
search of the list of area structs might be very costly. So in practice, Linux 
superimposes a tree on the list, using some fields that we have not shown, and 
performs the search on this tree. 

2. Is the attempted memory access legal? In other words, does the process have 
permission to read, write, or execute the pages in this area? For example, 
was the page fault the result of a store instruction trying to write to a read- 
only page in the code segment? Is the page fault the result of a process 
running in user mode that is attempting to read a word from kernel virtual 
memory? If the attempted access is not legal, then the fault handler triggers a 
protection exception, which terminates the process. This situation is labeled 


“2” in Figure 9.28. 


3. At this point, the kernel knows 
operation on a legal virtual addres 
page, swapping out the victim page if it is dirty, 


1. Is virtual address A legal? In other 


that the page fault resulted from a legal 
s. It handles the fault by selecting a victim 
swapping in the new page, 
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and updating the page table. When the page fault handler returns: the CPU 
restarts the faulting instruction, which sends A to the MMU again. This time, 
the MMU translates Æ normally, without generating a page fault. 

4 


? 


9.8 Memory Mapping. ") WM 


oon n : at E Garant 32 
Linux initializes ‘the contents of a virtual! memory area by associating it with an 
object ón disk, a process known as memory mapping. Areas can be mapped to one 
of two types of objects: 


1. Regular file in the Linux file System: An fea can be mapped to a cofitiguous 
section of a regular disk file, such as an executable object filé.The'file section 
is divided into page-size pieces, with each piece containing the initial contents 
of a virtual page. Because of demand paging, none of these virtual pages is 
actually swapped into physical memory until the CPU first touches the page 
(i.e., issues a virtual address that falls within that page’s region of the address 
space). If the area is largerthan the file section, then the area is padded with 
zeros. 


Anonymous file: An area can also be mapped to an anonymous file, created 
by the kernel, that contains all binary zeros. The first time the CPU touches 
a virtual page in such an area, the kernel finds an appropriate victim page 
in physical memory, swaps out the victim page if it is dirty, overwrites the 
victim page with binary. zeros, and updates the page table to mark the page 
as resident. Notice that no data are actually transferred between disk and 
memory. For this reason, pages in areas that are mapped to anonymous files 
are sometimes called demand-zero pages. 


In either case, once a virtual page is initialized,it is swapped back and forth 
between a special swap file maintained by the kernel. The swap file is also known 
as the swap space or the swap area. An important point to realize is that at any 
point in time, the swap space bounds the total amount of virtual pages that can be 
allocated by the currently running processes. 


9.8.1 Shared Objects Revisited 


The idea of memory mapping resulted from a clever insight that if the virtual 
memory system could be integrated into the conventional file system, then it could 
provide a simple and efficient way to load programs and data into memory. 

As we have seen, the process abstraction promises to provide each process 
with its own private virtual address space that is protected from errant writes 
or reads by other processes. However, many.processes have identical read-only 
code areas. For example, each process that runs the Linux shell program bash has 
the same code area. Further, many programs need to access identical copies of 
read-only run-time library code. For example, every C program requires functions 
from the standard C library such as printf. It would be extremely wasteful for 
each process to keep duplicate copies of these commonly used codes ir physical 
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memory. Fortunately, 
controlling how objects are shared by multip 
An object can be mappe 
object or a private object. If a process maps as 
address space, then any writes that the process makes to tbat area are visible to 
d the shared object into their virtual 
riginal object on disk. 


any other processes t 
memory. Further, the 

Cbanges made to an area m 
not visible to other processes, an 
are not reflected back to the object o 
shared object is mapped is often calle 

Suppose that process 1 maps a sh 
as shown in Figure 9.29(a). Now supp 


Figure 9.29 

A shared object. (a) After 
process 1 maps the shared 
object. (b) After process 

2 maps the same shared 
object. (Note that the 
physical pages are not 
necessarily contiguous.) 
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Figure 9:30 >. - Processi Physical , Process 2 
A private topy-on-write virtual! memory memory virtual memory 
object. (a) After both 
processes have mapped 
the private copy-on-write. 
object. (b) After process 
2, writes to a page in the 
Rrivate area. 
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ject into its address space (not necessarily at the same virtuabaddress as process‘1), 
as shown in Figure 9.29(b). 

Since each object has 4 unique filename,‘ the kernel cdi quickly determine 
that process ] has alréady mapped this object and can point the page'table entries 
in process 2'to the'appropriàté physical pagés. ‘Phé key point is that only a singe 
copy of the shar'ed object needs tó be storéd in physical memory, evén though the 
object is mapped into multiple shared areas. For convenience, we have shown the 
physical pages as being contiguous, but of course this is not trte in general. 

Private objects are mapped into virtual memory using a clever technique 
known as copy-ori-write. A private object begins life in exactly the same way asa 
shared object, with only one copy of the private object stored in physical memory. 
For example, Figure 9.30(a) shows a case where two processes have mapped a 
private object into different areas of their virtual memories'but share the same 


Write to private 
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physical copy of the object. For each process that maps the private object, the page 
table entries for the corresponding private area are flagged as read-only, and the 
area struct is flagged as private copy-on-write. So long as neither process attempts 
to write to its respective private area, they continue to share a single copy of the 
object in physical memory. However, as soon as a process attempts to write to 
some page in the private area, the write triggers a protection fault. 

When the fault handler notices that the protection exception was caused by 
the process trying to write to a page in a private copy-on-write area, it creates a 
new copy of the page in physical memory, updates the page table entry to point 
to tbe new copy, and then restores write permissions to the page, as shown in 
Figure 9.30(b). When the fault handler returns, the CPU re-executes the write, 
which now proceeds normally on the newly created page. 

By deferring the copying of the pages in private objects until the last possible 
moment, copy-on-write makes the most efficient use of scarce physical memory. 










9.8.2 The fork Function Revisited 





Now that we understand virtual memory and memory mapping, we can get a clear 
idea of how the fork function creates a new process with its own independent 
virtual address space. 

When the fork function is called by the current process, the kernel creates 
various data structures for the new process and assigns it a unique PID. To create 
the virtual memory for the new process, it creates exact copies of the current 
process's mm struct, area structs, and page tables. It flags each page in both 
processes as read-only, and flags each area struct in both processes as private copy- 
on-write, 

When the fork returns in the new process, the new process now has an exact 
copy of the virtual memory as it existed when the fork was called. When either 
of the processes performs any subsequent writes, the copy-on-write mechanism 
creates new pages, thus preserving the abstraction of a private address space for 
each process. 

















9.8.3 The execve Function Revisited 






Virtual memory and memory mapping also play key roles in the process of loading 
programs into memory. Now that we understand these concepts, we can under- 
stand how the execve function really loads and executes programs. Suppose that 
the program running in the current process makes the following call: 








execve("a.out", NULL, NULL); 






As you learned in Chapter 8, the execve function loads and runs the program 
contained in the executable object file a. out within the current process, effectively $ 
replacing the current program with the a.out program. Loading and running % 
a.out requires the following steps: E 
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Figure 9.31 

How the loader maps the 
areas of the user address  : 
space. 


x Private, demand-zefo 
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i Eum. zr m ans ; : 
1. Delete existing user areas. Delete the existing area structs in the user portion 
of the curtent process's virtual address. + 


2. Map private areas. Create new area structs for the code, ‘data, bss, arid stack 
areas of thé new prograin. All of these new ateas aré private copy-on-write. 
The code and data areas àre thapped t6 the . text and .data sections of the 
a. out file.'The bss area is‘demarid-zero, mapped to an anonymous file whofe 
size is, cóníained in a’. out, The stack and heap afea are also demand-zéro, 
initially of zero length. Figure 9.31 summarizes the differtént mappings of the 


Private areas. 


- 


‘Map shared areas If the a.out Proein was linked with shared objects, such 
as the standard C library libc.so, then these objects are dynamically linked 
into,the program, and then mapped into the shared region of the user's virtual 
address space, ta 


4. Set the program counter (PC). The Jast thing that execve‘does is to set the 
program counter in the current’ process’s context to point to the entry point 
in the code area. 

The next time this process is scheduled, it will begin execution from the entry 


point. Linux will swap in code ahd'data pages as needėd. 
tu P Hu 


I 5 Moo: Uu t 


9.8.4 User-Level Memóry Mapping with the mmap Function 


Linux processes can use the mmap function to create new areas of virtual memory 
and to map objetts into these areas. 
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Figure 9.32 


Visual interpretation of 


mmap arguments. 
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Disk file specified by Process 
file descriptor £d virtual memory 


#include <unistd.h> 
#include <sys/mman.h> 


void *mmap(void *start, size_t length, int prot, int flags, 
int fd, off_t offset); 
Returns: pointer to mapped area if OK, MAP. FAILED (-1) on error 


The mmap function asks the kernel to create a new virtual memory area, preferably 
one that starts at address start, and to map a contiguous chunk of the object 
specified by file descriptor fd to the new area. The contiguous object chunk has a 
size of length bytes and starts at an offset of offset bytes from the beginning of 
the file. The start address is merely a hint, and is usually specified as NULL. For 
our purposes, we will always assume a NULL start address. Figure 9.32 depicts the 
meaning of these arguments. : 

The prot argument contains bits that describe the access permissions of the 
newly mapped virtual memory area (ie., the vm. prot bits in the corresponding 
area struct). 


PROT EXEC. Pages in the area consist of instructions that may be executed 
by the CPU. 


PROT. READ. Pages in the area may be read. 
PROT. WRITE. Pages in the area may be written. 
PROT. NONE. Pages in the area cannot be accessed. 


The flags argument consists of bits that describe the type of the mapped 
object. If the MAP .ANON flag bit is set, then the backing store is an anonymous 
object and the corresponding virtual pages are demand-zero. MAP. PRIVATE 
indicates a private copy-on-write object, and MAP. SHARED indicates a shared 
object. For example, 


bufp = Mmap(NULL, size, PROT READ, MAP PRIVATE|MAP ANON, 0, 0); 
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asks the kernel to create a new read-only, private, demand-zero area of virtual 
memory containing size bytes. If the call is successful, then bufp- contains the 
address of the new area. 

The munmap function deletes regions of virtual memory: 


#include <unistd.h> 
#include <sys/mman.h> 


int munmap(void *start, size_t length); 
Returns: 0 if OK, —1 on error 





The munmap function deletes the area starting at virtual address start and consist- 
ing of the next length bytes. Subsequent references to the deleted region result 
in segmentation faults, 





Write a C program mmapcopy.c ‘that uses mmap toc copy an arbitiaryd size disk file to 
stdout. The name of the input file should be passed as a command-line argument. 


f 





9.9 Dynamic Memory Allocation 


MT b 
While it is certainly possible to use the low-level mmap and munmap functions to 
create.and delete areas ofivirtual memory, C programmers typically find it more 
convenient and more portable to use a dynamic memory allocator when they need 
to acquire additional virtual memory at run time: 

A dynamic memory allocator maintains an area of a process's virtual memory 
known as thé heap (Figure 9.33). Details vary froth systém to system, but without 
loss of generality, we will assumė*that the Heap is'an area of demand-zero mem- 
ory thàt begins immediately after the uninitialized:data area and grows upward 
(toward higher addresses). For each process, the kernel maintains a variable brk 
(pronounced “break”) that points to the top of the heap. 

An allocator maintains the heap:ds a collection of various-size blocks. Each 
block is a contiguous chunk of virtual memory that is either allocated’or free. An 
allocated block has been explicitly reserved for use by the ápplication. A free block 
is available to be allocated. A free block remains free until it is explicitly allocated 
by the application. An allocated block remains allocated until it is freed, éither 
explicitly by the application Or implicitly by the memory allocator itself. 

Allocators come in two basic styles. Both styles require the application to 
explicitly allocate blocks. They differ about which entity is responsible for freeing 
allocated blocks. 


* Explicit allocators require the application to explicitly free any allocated 
blocks. For example, the C standard library provides an explicit allocator 
called the malloc package. C programs allocate a block by calling the malloc 





840  Chapter9 Virtual Memory 


Figure 9.33 
The heap. 





User stack 


Memory-mapped region 
for shared libraries 


Top of the heap 
(brk ptr) 





Uninitialized data (. bss) | 





Initialized data (. data) 








Code (. text) 


function, and free a block by calling the free function. The new and delete 
calls in C** are comparable. 


Implicit allocators, on the other hand, require the allocator to detect when 
an allocated block is no longer being used by the program and then free 
the block. Implicit allocators are also known as garbage collectors, and the 
process of automatically freeing unused allocated blocks is known as garbage 
collection. For example, higher-level languages such as Lisp, ML, and Java rely 
on garbage collection to free allocated blocks. 


The remainder of this section discusses the design and implementation of 
explicit allocators. We will discuss implicit allocators in Section 9.10. For concrete- 
ness, our discussion focuses on allocators that manage heap memory. However, 
you should be aware that memory allocation is a general idea that arises in a vari- 
ety of contexts. For example, applications that do intensive manipulation of graphs 
will often use the standard allocator,to acquire, a large block of virtual memory 
and then use an application-specific allocator to manage the memory within that 
block as the nodes of the graph are created and destroyed. 


9.9.1] The malloc and free Functions 

The C standard library provides an explicit allocator known as the malloc package. 

Programs allocate blocks from the heap by calling the malloc function. 
#include <stdlib.h> 


void *malloc(size t size); 
Returns: pointer to allocated block if OK, NULL on error 
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The malloc function returns a pointer to a block of memory of at least size bytes 
that is suitably aligned for any kind of data object that might be contained in the 
block. In practice, the alignrhent depends on whether the code is conipiled to run 
in 32-bit mode (gcc -m32) or 64-bit mode (the default). In 32-bit mode, malloc 
returns a block whose address is always a multiple of 8. In 64-bit mode, the address 
is always a multiple of 16. 

Ifmalloc encounters a problem (e.g;, the programréquests a block of memory 
that is larger than the-available virtual memory), thén it returns NULL and sets 
errno. Malloc does not initialize the memory it returns. Applications that want 
initialized dynamic memory can use calloc, a thin wrapper around the malloc 
function that initializes the allocated memory to.zero. Applications that want to 
change the size of a previously allocated block.can.use the realloc function. 

Dynamic memory allocators such as malloc can allocate or deallocate heap 
memory explicitly by using the mnap and munmap functions, or they can use the 
sbrk function: 


#include <unistd.h> 


void *sbrk(intptr_t incr); 
Returns: old brk pointer on success, —T'on error 


Thé sbrk function grows or shrinks the heap by adding incr to the kernel’s brk 

pointer. If successful, it returns the old value of brk, otherwise it returns —1 and 

sets errno to ENOMEM. If incr is zero, then sbrk returns the current value of 

brk. Calling sbrk with a negative incr is legal but tricky because the return value 

(the old value of brk) points to abs (incr) bytes past the new top.of the heap. 
Programs free allocated heap blocks by calling the free function. 


- #include «stdlib.h» 


void free(void *ptr); 
Returns: nothing 


The ptr argument must point to the beginning of an allocated block that was 
obtained from malloc, calloc, or realloc. If not, then the behavior of free 
is undefined. Even worse, since it returns nothing, free gives no indication to 
the application that something is wrong. As we shall see in Section 9.11, this can 
produce some baffling run-time errors. 
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Figure 9.34 pi 

Allocating and freeing : 
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(9) p4 = malloc(2*sizeof (int) ) 


Figure 9.34 shows how an implementation of malloc and free might manage 
a (very) small heap of 16 words for a C program. Each box represents a 4-byte $ 
word. The heavy-lined rectangles correspond to allocated blocks (shaded) and — 2E 
free blocks (unshaded). Initially, the heap consists of a single 16-word double- | 
word-aligned free block.’ ES 


Figure 9.34(a). The program asks for a four-word block. Malloc responds by 
carving out a four-word block from the front of the free block and return- i 
ing a pointer to the first word of the block. 


Figure 9.34(b). The program requests a five-word block. Malloc responds by ; 
allocating a six-word block from the front of the free block. In this exam- i 
ple, malloc pads the block with an extra word in order to keep the free 
block aligned on a double-word boundary. 


Figure 9.34(c). The program requests a six-word block and malloc responds by ; 
carving out a six-word block from the free block. a 


Figure 9.34(d). The program frees the six-word block that was allocated in 
Figure 9.34(b). Notice that after the cal] to free returns, the pointer p2 MEE 


1. Throughout this section, we will assume that the allocator returns blocks aligned to 8-byte double- | 
word boundaries. 
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still points to the freed block. It is the responsibility of the application not 
to use p2 again until it is reinitialized by a new call to malloc. 


Figure 9.34(e). The program requests a two-word block. In this case, malloc 
allocates a portion of the block that was freed in the previous step and 


returns a pointer to this new block. 
" 


9.9.2 Why Dynamic Memory Allocation? 


The most important reason that programs use dynamic memory allocation is that 
gften they do not know the sizes of certain data structures until the program 
actually runs. For example, suppose we are asked to write a C program that reads 
a list of n ASCII integers, one integer per line, from stdin into a .C array. The 
input consists of the integer n, followed by the n integers to be read and stored 
into the array. The simplest approach is to define the array statically. with some 


hard-coded-maximum array size: ao 
^ 
1 include "ésapp“h" 
2 #define MAXN 15213 
3 
4 int array [MAXN]; p 
» gt r 
6 int main() 
7 d 
8 int i, n; 
9 
10 scanf("%d", &n); 
11 if (n'» MAXN) 
12 app error("Input file ‘too big"); 
13: for (i = 0; i'2'm; ie) 
14 scanf("%d", Earray[i]); 
15 exit(0); k 
16 Jj i j 


t 4 

Allocating arrays witli'hard-coded size’ like’ this is often a Bad idea‘ The value 
of MAXNis arbitrary and has no relation to,the actual amount of avajlable virtual 
memory on the machine. Further, if the user of this program wanted to read a file 
that was larger than MAXN, the only recourse would be to recompile the program 
witlí a larger valüe of MAXN. While not a problem for this ‘simple example, the 
presence of hard-coded array bounds can bécome a maintenance nightmare for 
large software products with millions of lines’ of code and numerous users. 

A better appróach'is to allocate the array dynamically, at run time, after the 
value of n becomes known. With this approach, the maximum size of the array is 
limited only by the amount of available virtual memory. 1 
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1 #include "csapp.h" 

2 

3 int main() 

4 f 

5 int *array, i, n; 

6 

7 scanf ("%d", &n); 

8 array = (int *)Malloc(n *, sizeof(int)); 

9 for (i = 0; i < n; i++) 

10 scanf ("%d", &array[il); 

11 free(array); n 
12 , exit(0); r 
3 + 


Dynamic memory allocation is a useful and important programming tech- 
nique. However, in order to use allocators correctly and efficiently, programmers 
need to have an understanding of how they work. We will discuss some of the grue- 
some errors that can result from the improper use of allocators in Section 9.11. 


9.9.3 Allocator Requirements and Goals 


Explicit allocators must operate within some rather stringent constraints: 


Handling arbitrary request sequences. An application can make an arbitrary se- 
quence of allocate and free requests, subject to the constraint that each 
free request must correspond to a currently allocated block obtained from 
a previous allocate request. Thus, the allocator cannot make any assump- 
tions about the ordering of allocate and free requests. For example, the 
allocator cannot assume that all allocate requests are accompanied by a 
matching free request, or that matching allocate and free requests are 
nested. 

Making immediate responses to requests. The allocator must respond immedi- 


ately to allocate requests. Thus, the allocator is not allowed to reorder or 
buffer requests jn order to improve performance. 


Using only the heap. In order for the allocator to be scalable, any nonscalar data 
structures used by the allocator must be stored in the heap itself. 


Aligning blocks (alignment requirement). The allocator must align blocks in 
such a way that they can hold any type of data object. 


Not modifying allocated blocks. Allocators can only manipulate or change free 
blocks. In particular, they are not allowed to modify or move blocks 
once they are allocated. Thus, techniques such as compaction of allocated 
blocks are not permitted: L 
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Working within these constraints, the author of an allocator attempts to meet 
the often conflicting performance goals of maximizing throughput and memory I 1 
utilization. | &§ 


tt 


Goal 1: Maximizing throughput. Given some sequence of n allocate and free a 
requests | d 


1 


we would like to maximize an, allocator’s “throughput, which is defined as the 
number of. requests that it completes per unit time. For example, if an alloca- i d 
tor completes 500 allocate requests and 500 free requests in 1 second, then its 
throughput is 1,000 operations per second. Tn general, we can maximize through- | M 
put by minimizing the average time to satisfy allocate and free requests. As we'll a 
see, it is not too difficult to develop allocators with reasonably good performance TE 
where the worst-case running time of an állocate request is linear in the number i | 

| 

| 

| 


i 
Ro, Ry... Rgs e.. Ry H 
| 


of free blocks and the running time of a free request is constant. | 


Goal 2: Maximizing memory utilizgtion. Naive programmers often incorrectly D 
assume-that virtual memory is an unlimited resource. In fact, the total amount | 1 
of virtual memory allocated by all of the processes in a system is limited by the . 
amount of swap space on disk..Good programmers know that virtual memory is i 
a finite resource that must be used efficiently. This is especially true for a dynamic 
memory allocator that might be asked to allocate and free large blocks of memory. MES | 
There are:a number of ways to characterize how efficiently an allocator uses P 
the heap. In our experience, the most useful metric is' peak utilization. As before, 
we are giveh some sequence of n allocate'and free requests! 


Ro, Ri, nnn Ry PM R44 


a payload of p bytes. After request R, has completed, let the aggregate‘payload, 
denoted'P;, be the sum-of'thé payloads of the currently allocated blocks, and let 4 
H, denote the current (monotonically nondecreasing) size of the heap. 


k 

If an application requests a block of p bytes, then the resulting allocated block has 
| Then the peak utilization over the first k + 1 requests, denoted by Uj, is i : 
| 
| 
| 


given by i | oF 
ME 

LEBEN q 

k H; | 

: The objective of the: allocator, then, is to maximize the peak utilization U,. ; | | 
i — over the entire sequence. As we will see, there is a tension between maximizing | j 
: throughput and utilization.. In particular, it is easy to write an allocator that | 


maximizes throughput at the expense of heap utilization. One of the interesting o 1 
challenges in any allocator design is finding an appropriate balance between the 
two goals. 
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Aside Relaxing the monotonicity assumption + x P 


x 
y a 


We could relax the monotonically nondecreasing assumption in our definition of U; and allow thé heap 
to grow up and down.by letting, Hj be the high-water.mark over the first k + 1requests. — « 


9.9.4 Fragmentation 


The primary cause of poor heap utilization is a phenomenon known as fragmen- 
tation, which occurs when otherwise unused memory is not available to satisfy 
allocate requests. There are two forms of fragmentation: internal fragmentation 
and external fragmentation. 

Internal fragmentation occurs when an allocated block is larger than the pay- 
load. This might happen for a number of reasons. For example, the implementation 
of an allocator might impose a minimum size on allocated blocks that is greater 
than some requested payload. Or, as we saw in Figure 9.34(b), the allocator might 
increase the block size in order to satisfy alignment constraints. 

Internal fragmentation is straightforward to quantify. It is simply the sum of 
the differences between the sizes of the allocated blocks and their payloads. Thus, 
at any point in time, the amount of internal fragmentation depends only on the 
pattern of previous requests and the allocator implementation. 

External fragmentation occurs when there is enough aggregate free memory 
to satisfy an allocate request, but no single free block is large enough to handle 
the request. For example, if the request in Figure 9.34(e) were for eight words 
rather than two words, then the request could not be satisfied without requesting 
additional virtual memory from the kernel, even though there are eight free words 
remaining in the heap. The problem arises because these eight words are spread 
over two free blocks. 

External fragmentation is much more difficult to quantify than internal frag- 
mentation because it depends not only on the pattern of previous requests and the 
allocator implementation but also on the pattern of future requests. For example, 
suppose that after k requests all of the free blocks are exactly four words in size. 
Does this heap suffer from external fragmentation? The answer depends on the 
pattern of future requests. If all of the future allocate requests are for blocks that 
are smaller than or equal to four words, then there is no external fragmentation. 
On the other hand, if one or more requests ask for blocks larger than four words, 
then the heap does suffer from external fragmentation. 

Since external fragmentation is difficult to quantify and impossible to predict, 
allocators typically employ heuristics that attempt to maintain small numbers of 
larger free blocks rather than large numbers of smaller free blocks. 


9.9.5 Implementation Issues 


The simplest imaginable allocator would organize the heap as a large array of 
bytes and a pointer p that initially points to the first byte of the array. To allocate 
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size bytes, malloc would save the current value of p on the stack, increment p by 
size, and return the old value of p to the caller. Free would simply return to the 
caller without doing anything, 

This naive allocator is an extreme point in the design space. Since each malloc 
and free execute only a handful of instructions, throughput would be extremely 
good. However, since the allocator never reuses any blocks, memory utilization 
would be extremely bad. A practical allocator that strikes a better balance between 
throughput and utilization must consider the following issues: 


Free block organization. How do we keep track of free blocks? 


Placeme. t How do we choose an appropriate frée black in which to place a 


newly allocated block? 
Splitting. After we place a newly allocated block in some free block, what do 
we do with the remainder of the free block? 


Coalescing. What do we do with a block that has just been freed? 


The rest of this section looks at these issues in more detail. Since the basic 
techniques of placement, splitting, and coalescing cut across many different free 
block organizations, we will introduce them in the context of a simple free block 
organization known as an implicit free list. 


9.9.6 Implicit Free Lists 


Any practical allocator needs some data structure that allows it to distinguish 
block boundaries and to distinguish between allocated and free blocks. Most 
allocators embed this information in the blocks themselves. One simplé approach 
isshowninFigure9.35. — 

In this case, a block consists of a one-word header, the payload, and possibly 
some additional padding. The header encodes the block size (including the header 
ahd any padding) as well as whether the block is allocated or free. If we impose a 
double-word alignment constraint, then the block size'is alwáys a multiple of 8 and 
the 3 low-order bits of the block size are always zero. Thus, we need to store only 
the 29 high-order bits of the block size, freeing the remaining 3 bits to encode 
other information. In this case, we are using the least significant of'these bits 


1 


Figure 9.35 31 Header 3210 
Format of a simple heap mallotretumsa Block size foal ) m a prae 
block. pointer to the beginning ——» = UL Free 
of the payload ` 
' Payload The block size includes 


the header, payload, and 
'any padding 


(allocated block only) 

















i Figure 9.36 Organizing the heap with an implicit free list. Allocated blocks are shaded. Free blocks are 
unshaded. Headers are labeled with (size (bytes)/allocated bit). 





(the allocated bit) to indicate whether the block is allocated or free. For example, 
suppose we have an allocated block with a block size of 24 (0x18) bytes. Then its 
header would be 


[ 


2 


0x00000018 | Ox1 = 0x00000019 


Similarly, a free block with a block size of 40 (0x28) bytes would have a header of 


= « : = 
EE, E Mc rg Eire 


0x00000028 | 0x0 = 0x00000028 r 


The header is followed by the payload that the application requested when it 
called malloc. The payload is followed by a chunk of unused padding that can be 
any size. There are a number of reasons for the padding. For example, the padding 
might be part of an allocator’s strategy for combating external fragmentation. Or 
it might be needed to satisfy the alignment requirement. 

Given the block format in Figure 9.35, we can organize the heap as a sequence 
of contiguous allocated and free blocks, as shown in Figure 9.36. 

We call this organization an implicit free list because the free blocks are linked 
implicitly by the size fields in the headers. The allocator can indirectly traverse 
the entire set of free blocks by traversing all of the blocks in the heap. Notice that 1 
we need some kind of specially marked end block—in this example, a terminating T 
header with the allocated bit set and a size of zero. (As we willsee in Section 9.9.12, | 
setting the allocated bit simplifies the coalescing of free blocks.) 

The advantage ofan implicit free list is simplicity. A significant disadvantage is 
that the cost of any operation that requires a search of the free list, such as placing h 
allocated blocks, will be linear in the total number of allocated and free blocks in : 
the heap. : 

It is important to realize that the system’s alignment requirement and the ' 
allocator’s choice of block format impose a minimum block size on the allocator. 
No allocated or free block may be smaller than this minimum. For example, if 
we assume a double-word alignment requirement, then the size of each block 
must be a multiple of two words (8 bytes). Thus, the block format in Figure 9,35 
induces a minimum block size of two words: one word for the header and another 
to maintain the alignment requirement. Even if the application were to request a 
single byte, the allocator would still create a two-word block. 
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Determine the block sizes cand header vaites that would Tenali ton the fol- : 1 
lowing sequence of malloc requests. Assumptions: (1) The allocator maintains | a 
double-word alignment and uses an implicit free list with the block format from | 
Figure 9.35. (2) Block sizes are rounded up to the nearest multiple of 8 bytes. 


Request Block size (decimal bytes) Block header (hex) 

malloc(1) TS ———— ; 
malloc(5) Doc DUE lue ud i 
malloc(12) epee, LL F 
malloc(13) os, eee, 7 


* > 


9.9.7 Placing Allocated Blocks 


When an application requests a block of & bytes, the allocator searches the free 
list for a free block that is large enough to hold the requested block. The manner a 
in which the allocator performs this search is determined by the placement policy. 

Some common policies are first.fit, next fit, and, best fit. 

First fit searches the free list from the,beginning and chooses the first free i 
block that fits. Next fit is similar to first fit, but instead of starting each search at 
the beginning of the list, it starts each; search where the previous search left off. ; 4 
Best fit examines every free block and chooses the free block with the smallest size EE. 
that fits. " 

Ap advantage of first fit is that it tends to retain large free blocks at the end i r 
of the list. A disadvantage is that it tends to leave “splinters” of small free blocks | dg 
toward the beginning of the list, which will increase the search time for larger p 
blocks. Next fit was first proposed by Donald Knuth as an alternative to first fit, i 
motivated by the idea that if we found a fit in some free block the last time, there 
is a good chance that we will find a fit,the next time in the remainder of the block. 
Next fit can run significantly faster ghan first fit, especially if the front of the list 
becomes littered with many small splinters. However, some studies suggest that : 
next fit suffers from worse memory utilization than first fit. Studies have found | 14 
that best fit generally enjoys better memory utilization than either first fit or next . ’ 
fit. However, the disadyantage of using best fit with simple free list organizations | od 
such as the implicit free:list is that it requires an exhaustive search of the heap. 
Later, we will look,at more sophisticated segregated- free list.organizations that 
approximate a best-fit policy without an exhaustive search of the heap. 


t 


9:9.8 Splitting Free Blocks 


Once the allocator has located a free block that fits, it must make another policy 
decision about how much of the free block to allocate. One option is to use 
the entire free block. Although simple and fast, the main disadvantage is that it 
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Figure 9.37 Splitting a free block to satisfy a three-word allocation request. Allocated blocks are shaded. 
Free blocks are unshaded. Headers are labeled with (size (bytes)/allocated bit). 


introduces internal fragmentation. If the placement policy tends to produce good 
fits, then some additional internal fragmentation might be acceptable. 
However, if the fit is not good, then the allocator will usually opt to split S 
the free block into two parts. The first part becomes the allocated block, and the * 
remainder becomes a new free block. Figure 9.37 shows how the allocator might F 
split the eight-word free block in Figure 9.36 to satisfy an application’s request for 
three words of heap memory. T 





9.9.9 Getting Additional Heap Memory 





What happens if the allocator is unable to find a fit for the requested block? One Mi 
option is to try to create some larger free blocks by merging (coalescing) free — i 
blocks that are physically adjacent in memory (next section). However, if this 
does not yield a sufficiently large block, or if the free blocks are already maximally 
coalesced, then the allocator asks the kernel for additional heap memory by calling i 
the sbrk function. The allocator transforms the additional memory into one large $. 
free block, inserts the block into the free list, and then places the requested block 
in this new free block. 


9.9.10 Coalescing Free Blocks 


E 
pe. F 
‘When the allocator frees an allocated block, there might be other free blocks $88: 
that are adjacent to the newly freed block. Such adjacent free blocks can cause 48 
a phenomenon known as false fragmentation, where there is a lot of available free — ; 
memory chopped up into small, unusable free blocks. For example, Figure 9.38 gf 
shows the result of freeing the block that was allocated in Figure 9.37. The result — 288 
is two adjacent free blocks with payloads of three words each. As a result, a E 
subsequent request for a payload of four words would fail, even though the f 
aggregate size of the two free blocks is large enough to satisfy the request. F | 
To combat false fragmentation, any practical allocator must merge adjacent 1 
free blocks in a process known as coalescing. This raises an important policy 2 
decision about when to perform coalescing. The allocator can opt for immediate 
coalescing by merging any adjacent blocks each time a block is freed. Or it can opt 
for deferred coalescing by waiting to coalesce free blocks at some later time. For 
example, the allocator might defer coalescing until some allocation request fails, 
and then scan the entire heap, coalescing all free blocks. 


„of 


heap aligned 





Figure 9.38 An example of false fragmentation. Allocated blocks are shaded. Free blocks are unshaded. 
Headers are iabeled with (size (bytes)/allocated bit). 


Immediate coalescing is straightforward and can be performed in constant 
time, but with some request patterns it can introdüce a form of thrashing where a 
block is repeatedly coalesced and then split soon thereafter. For example, in Fig- 
ure 9,38, a repeated pattern of allocating and freeing a three-word block would 
introduce a lot of unnecessary splitting and coalescing. In our discussion of allo- 
cators, we will assume immediaté'coalescing, but you should be aware that fast 
allocators often-opt4or some form of deferred coalescing. s: 


9,9.11, Coalescing with Boundary Tags 


How-does an allocator implement coalescing? Let'us refer to the block we want 
to free as the current block. Then coalescing the next free block (in memory) is 
straightforward and efficient. The header of the current block points to the header 
of thé next block, which can be checked to'détermine if the next block is free. If 
so, its size is simply added to the sizé of the current’ header and’ the blócks are 
coalesced ih constant time. t 

But how would wé:coalésce the previous blóck? Given an' implicit free list'of 
blocks with headers, the only optión would be to search the entire list, remember- 
ing the location of the previous block,'until we reached the current block. With an 
implicitfree list, this means that each call to free would require time linear in the 
sizeof the heap. Even with more sophisticated free list organizations, the search 
time would not be constant. 

Knuth developed a clever and general technique, known as bounddry tags, 
that allbws for constant-time coalescing of the previous block..The idea, which is 
shown in Figure 9.39, is to add'a footer (the boundary tag)'at the end of each block, 
Where the footer is à replica of the*header. If each block includes such a footer, 
then the allocator can determine the starting lótatión' and status of the previous 
block by inspecting itè footer, which is always one word away from the start of the 
cutrent block. 

Consider all the cases that &àn exist when the allocator frees the current block: 


1. The previous and next blocks are both allocated. 

2. The previous block is allocated and the next block is free. 
3. The previous block is free and the next block is allocated. 
4. The previous and next blocks are both free. 





~ 
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Figure 9.39 31 Seo 
Format of heap block that | 
uses a boundary tag. 


a = 001: Allocated 
Block sizə Jar) Header a = 000: Free 
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Payload 
(allocated block only) 
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Padding (optional), i 
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Block size a/f | Footer i 
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Figure 9.40 shows how we would coalesce each of the four cases. f 
In case 1, both adjacent blocks are allocated and thus no coalescing is possible. a 
So the status of the current block is simply changed from allocated to free. In case cif 
2, the current block is merged with the next block. The header of the current block 
and the footer of the next block are updated with the combined sizes of the current 1 
and next blocks. In case 3, the previous block is merged with the current block. 
The header of the previous block and the footer of the current block are updated 1 
with the combined sizes of the two blocks. In case 4, all three blocks are merged 
to form a single free block, with the header of the previous block and the footer of 
the next block updated with the combined sizes of the three blocks. In each case, 
the coalescing is performed in constant time. T 
The idea of boundary tags is a simple and elegant one that generalizes to E 
many different types of allocators and free list organizations. However, there is | 
a potential disadvantage. Requiring each block to contain both a header and a 
footer can introduce significant memory overhead if an application manipulates 
many small blocks. For example, if a graph application dynamically creates and E 
destroys graph nodes by making repeated calls tomalloc and free, and each graph x 
node requires only a couple of words of memory, then the header and the footer Ó 
will consume half of each allocated block. E 
Fortunately, there is a clever optimization of boundary tags that eliminates | 
the need for a footer in allocated blocks. Recall that when we attempt to coalesce 
the current block with the previous and next blocks in memory, the size field in f. | 
the footer of the previous block is only needed if the previous block is free. If we * 
were to store the allocated/free bit of the previous block in one of the excess low- 
order bits of the current block, then allocated blocks would not need footers, and 
we could use that extra space for payload. Note, however, that free blocks would 
still need footers. 
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Practice Problem 93: (solution page 883) ii. ED ee er sa 


Determine the minimum block size for each of the following combinations of 
alignment requirements and block formats. Assumptions: Implicit free list, zero- | 
size payloads are not allowed, and headers and footers are stored in 4-byte word. -$ 
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Figure 9.40 

Coalescing with 
boundary tags. Case 1: 
prev and next allocated. 
Case 2: prev allocated, next 
free. Case 3: prev free, next 
allocated. Case 4: next and 
prev free. 
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Minimum block 
Alignment Allocated block Free block size (bytes) 
Single word Header and footer Header and footer 
Single word Header, but no footer ^ Header and footer wie SS 
Double word Header and footer Header and footer Peer ee 


Double word Header, but no footer Header and footer ee eevee 


9.9.12 Putting It Together: Implementing a Simple Allocator 


Building an allocator is a challenging task. The design space is large, with nu- 
merous alternatives for block format and free list format, as well as placement, 
splitting, and coalescing policies. Another challenge is that you are often forced 
to program outside the safe, familiar confines of the type system, relying on the 
error-prone pointer casting and pointer arithmetic that is typical of low-level sys- 
tems programming. 

While allocators do not require enormous amounts of code, they are subtle 
and unforgiving. Students familiar with higher-levellanguages such as C--4 or Java 
often hit a conceptual wall when they first encounter this style of programming. To 
help you clear this hurdle, we will work through the implementation of a simple 
allocator based on an implicit free list with immediate boundary-tag coalescing. 
The maximum block size is 22? = 4 GB. The code is 64-bit clean, running without 
modification in 32-bit (gcc -m32) or 64-bit (gcc -m64) processes. 


General Allocator Design 


Our allocator uses a model of the memory system provided by the memlib.c 
package shown in Figure 9.41. The purpose of the model is to allow us to run 
our allocator without interfering with the existing system-level malloc package. 

The mem, init function models the virtual memory available to the heap as a 
large double-word aligned array of bytes, The bytes between mem_heap and mem_ 
brk represent allocated virtual memory. The bytes following mem_brk represent 
unallocated virtual memory. The allocator requests additional heap memory by 
calling the mem_sbrk function, which has the same interface as the system's sbrk 
function, as well as the same semantics, except that it rejects requests to shrink 
the heap. 

The allocator itself is contained in a source file (mm.c) that users can compile 
and link into their applications, The allocator exports three functions to applica- 
tion programs: 


1 extern int mm init(void); 
2 extern void *mm malloc (size t size); 
3 extern void mm free (void *ptr); 


The mm, init function initializes the allocator, returning 0 if successful and 
—1 otherwise. The mm malloc and mm free functions have the same interfaces 
and semantics as their system counterparts. The allocator uses the block format 
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iti ML code/vm/malloc/memlib.c 


m ud . » r ye 
/* Private global variables */ 


1 
2^ dtatic cHar *mem heap; /* "Points to first byte of hedp^*/ M 
3  sta&ic.char *mem brki. /* Points to last, byte ‘of heap plus 1 */ 
4 Static char +mem_max_addr; /* Max legal heap addr plus 1*/ 
5 
6  /* 
7 * mem init - Initialize the memory system model h A a 4 
8 */ 
9 void mem_init (void) 
10 ( 
m- mem heap = (char *)Malloc(MAX HEAP); TT 
12 4; mem brk.- (char *)mem heap; . T 
134 mem max addr =, (char *)(mem:heap + MAX HEAP); 
i4. } on, 
15 r 
16 /* 
17 * mem_sbrk - Simple, model of the sbrk function. Extends the heap 
18 * by incr bytes Ahd?returns the start address of the new area. In 
19 * this podel, the heap cannot be shrunk. " 
20 */ ; i 
21 void *mem sbrk(int jncr) 
22 4 
23 , 4 char *old“brk = mem_brk i. i 
A a i. 35 , D xr 
25, wtf C (incr &,9) |] (Gem brk + incr) > mem max, addr)), { 
26 errno = ENOMEM; , 7 ; 
27 fprintf(stderr, "ERROR: mem_sbrk, failed. Ran out of memory...\n"); 
28 return (void, *)-1; 
2 0 . ' . 
30 mem brk += incr; . 
31 return (void *)old, brk; ^ 
32 aboa Y X : 
Y s 1 7 - 7 code/vm/malloc/memlib.c 


Figuré 9,41 aila 1b Jb: Meinory.systerii model!” 
JI Doe E nf 


ii i V th cw mF 4 n 


bhown in Figure 9.39. The minimum block size is 16 bytes! THe free list'is organized 
as an implicit free list, with the invariant form shown in'Figure 9.42. 

The first word is an unused padding word aligned to a double-word boundary. 
Thé patlding'is followed by a special prologue block, which is an 8-byte allocated 
block consisting of only‘a header and a footet. The prologue block is created 
during initialization and is never freed. Following the prologue block are zero 
or more regular blocks that are created by calls to malloc or free. The „heap 


always ends with a special epilogue block, which is a zero-size allocated block E — 
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Figure 9.42 Invariant form of the implicit free list. 


that consists of only a header. The prologue and epilogue blocks are tricks that 
eliminate the edge conditions during coalescing. The allocator uses a single private 
(static) global variable (heap listp)that always points to the prologue block. 
(As a minor optimization, we could make it point to the next block instead of the 


prologue block.) 


Basic Constants and Macros for Manipulating the Free List 


Figure 9.43 shows some basic constants and macros that we will use throughout 
the allocator code. Lines 2-4 define some basic size constants: the sizes of words 
(WSIZE) and double words (DSIZE), and the size of the initial free block and 
the default size for expanding the heap (CHUNKSIZE). 

Manipulating the headers and footers in the free list can be troublesome 
because it demands extensive use of casting and pointer arithmetic. Thus, we find 
it helpful to define a small set of macros for accessing and traversing the free list 
(lines 9-25). The PACK macro (line 9) combines a size and an allocate bit and 
returns a value that can be stored in a header or footer. 

The GET macro (line 12) reads and returns the word referenced by argu- 
ment p. The casting here is crucial. The argument p is typically a (void *) pointer, 
which cannot be dereferenced directly, Similarly, the PUT macro (line 13) stores 
valin the word pointed at by argument p. 

The GET, SIZE and GET, ALLOC macros (lines 16-17) return the size and 
allocated bit, respectively, from a header or footer at address p. The remaining 


macros operate on block pointers (denoted bp) that point to the first payload 4 
byte. Given a block pointer bp, the HDRP and FTRP macros (lines 20-21) retur. 


pointers to the block header and footer, respectively. The NEXT. BLKP and 
PREV_BLKP macros (lines 24-25) return the block pointers of the next and 


previous blocks, respectively. 
The macros can be composed in various ways to manipulate the free list. For 


example, given a pointer bp to the current block, we could use the following line 
of code to determine the size of the next block in memory: i 


size_t size = GET SIZE(HDRP(NEXT.BLKP(bp))); 
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— OL ——_. Se rr. code/vm/malloc/mm.c 
1 /* Basic constants and macros */ 

2 #define WSIZE 4 /* Word and header/footer size (bytes) */ 

3 #define DSIZE 8 /* Double word size (bytes) */ 

4  #define CHUNKSIZE (1<<12) /*,Extend heap^by this amount (bytes) ,*/ 

5 

6 #define MAX(x, y) (GO 2 9? 00: (y) 

7 

'g /* Pack a size and allocated bit into 4 word */ à 
9 #define PACK(size, alloc) ((size) | (alloc)) 

10 

11 /* Read and write a word at address p */ 

12  didefine GET(p) (*(unsigned int *}(p)) 

13 #define PUT(p, val) (*(unsigned:int *)(p) x (val)) 

14 


15 /* Read the size and allocated fields from address p */ 
16 #define GET SIZE(p) (GET(p) & -Ox7) 
17 #define GET ALLOC(p) (GET(p) & 0x1) 


18 

19  /* Given block ptr bp, compute’ addresS'0$ tits header. and fdoter */ 
20 #define HDRP(bp) ((char *)(bp) - WSIZE) 

21 #define FTRP(bp) ((char *)(bp) + GET SIZE(HDRP(bp)) - DSIZE) 
22 y X t 


23 /* Given block ptr bp, compute address of next and, previousgblocks */ 
24 #define NEXT BLKP(bp) ((char *)(bp) + GET SIZE(((char *)(bp) - WSIZE))) 
25 #define PREV_BLKP(bp) ((char *)(bp) - GET_SIZE(((char *) (bp) - DSIZE))) 


code/vm/malloc/mm.c 


Figure 9.43 Basic constants,and macros for manipulating the free list. 


Creating the Initial Free List 1 zi 1 


Before calling mm_malloc or mn, free, the application must initialize thé heap by 
calling the mm, init function (Figure 9. 44). : à ae 
The mn. init function. gets four words from thé memory syst and initializes 
them to create the empty free list (lines 4-10). It then calls the extend „heap 
function (Figure 9.45), which extends the heap by CHUNKSIZE bytes and creates 
the initial free block. At this point, the allocator is initialized and ready to accept 
allocate and free requests from the application. 
' The extend_heap function is invoked intwo different circumstances: (1) when 
the heap is initialized and (2) when mm_malloc is unable to find a suitable fit. To i 
maintain alignment, extend, heap rounds up the requested size to the nearest 
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code/ym/malloc/mm.c 
int mm_init (void) 
{ 
/* Create the initial empty heap */ 
if ((heap listp = mem_sbrk(4*WSIZE)) == (void *)-1) 
return -1; 
PUT(heap_listp, 0); /* Alignment padding */ 
PUT(heap listp + (1#WSIZE), PACK(DSIZE, 1)); /* Prologue header */ 
PUT(heap.listp + (2*WSIZE), PACK(DSIZE, 1)); /* Prologue footer */ 
PUT(heap_listp + (3*WSIZE), PACK(O, 1)); /* Epilogue header */ 
heap_listp += (2*WSIZE); 


wen DA 5 hw we A 


- m 
NA cC 


/* Extend the empty heap with a free block of CHUNKSIZE bytes */ 
if (extend heap(CHUNKSIZE/WSIZE) == NULL) 

return -1; 
return 0; 


AA n 
An A w 


code/ym/malloc/mm.c 


Figure 9.44 mm init creates a heap with an initial free block. 


code/vm/malloc/mm.c 
Static void *extend heap(size t words) 
1 
char *bp; 
size_t size; 


/* Allocate an oven number of words to maintain alignment */ 
size = (words % 2) ? (words*i) * WSIZE : words * WSIZE; 
if ((long) (bp = mem sbrk(size)) == -1) 

return NULL; 


MERI e E 


/* Initialize free block header/footer and the epilogue header */ 
PUT(HDRP (bp), PACK(size, 00); /* Free block header */ 
PUT(FTRP(bp), PACK(size, 0)); /* Free block footer */ 
PUT(HDRP(NEXT BLKP(bp)), PACK(O, 1)); /* New epilogue header */ 


/* Coalesce if the previous block was free */ 
return coalesce(bp); 


7 code/vm/malloc/mm.c 


Figure 9.45 extend heap extends the heap with a new free block. 
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multiple of 2 words (8 bytes) and then requests the additional heap space from 
the memory system (lines 7-9). 

The remainder of the extend_heap function (lites 12-17) is somewhat subtle. 
The heap begins on a double-word aligned boundary, and every call to extend_ 
heap returns a block whose size is an integral number of double words. Thus, every 
call to mem_sbrk returns a double-word aligned chunk of memory immediately 
following the header of the epilogue block. This header becomes the header of 
the new free block (line 12), and thé last word of the chunk becomes the new 
epilogue block header (line 14). Finally, in the likely case that the previous heap 
was terminated by a free block, we call the coalesce function to merge the two 
free blocks and return the block pointer of the merged blocks (line 17). 


Freeing and Coalescing Blocks 


An application frees a previously allocated block by calling the mm, free function 
(Figure 9.46), which frees the requested block (bp) and then merges adjacent 
free blocks using the boundary-tags coalescing technique described in Section 
9.9.11. 

The code in the coalesce helper function is a straightforward implementation 
of the four cases outlined in Figure 9.40. There is one somewhat subtle aspect. The 
free list format we have chosen—-with its prologue and epilogue blocks that are 
always marked as allocated—allows us to ignore the potentially troublesome edge 
conditions where the requested block bp is at the beginning or end of the heap. 
Without these special blocks, the code would be messier, more error prone, and 
slower because we would have to check for these rare edge conditions on each 
and every free request. 


Allocating Blocks 


An application requests a block of size bytes of memory by calling the mm_malloc 
function (Figure 9.47). After checking for spurious requests, the allocator must 
adjust the requested block size to allow room for the header and the footer, and to 
satisfy the double-word alignment requirement. Lines 12-13 enforce the minimum 
block size of 16 bytes: 8 bytes to satisfy the alignment requirement and 8 more 
bytes for the overhead of the header and footer. For requests over 8 bytes (line 15), 
the general rule is to add in the overhead bytes and then round up to the nearest 
multiple of 8. inr 

Once the allocator has adjustéd the fequested size, it searches the free list for a 
suitable free block (line 18). If there is a fit, then the allocator places the requested 
block and optionally splits the excess (line 19) and then returns the address of the 
newly allocated block. 

If the allocator cannot find a fit, it extends the heap with a new free block 
(lines 24-26), places the requested block in the new free block, optionally splitting 
the block (line 27), and then returns a pointer to the newly allocated block. 








860  Chapter9 Virtual Memory 


ee —À——— code/vm/malloc/mm.c 


1 void mm free(void *bp) 

2 t1 : 

3 size t size - GET SIZE(HDRP(bp)); 

4 

5 PUT(HDRP(bp), PACK(size, 0)); j 
6 PUT(FTRP(bp), PACK(size, 0)); 

7 coalesce (bp) ; 

8 } 

9 

10 static void *coalesce(void *bp) 

u d 

12 size t prev_alloc = GET. ALLOC(FTRP (PREV. BLKP (bp))) ; 

13 size_t next alloc = GET. ALLOC (HDRP (NEXT. BLKP (bp) )) ; 

14 size t size = GET SIZE(HDRP(bp)); 

15 

16 if (prev alloc && next alloc) i /* Case 1 */ 
17 return bp; 

18 i} 

19 

20 else if (prev.alloc && !next alloc) { /* Case 2 */ 
21 size += GET SIZE(HDRP (NEXT, BLKP (bp))) ; 

22 PUT(HDRP(bp), PACK(size, 0)); 

23 PUT(FTRP(bp), PACK(size,0)); 

24 } 

25 

26 else if (!prev_alloc && next_alloc) { /* Case 3 */ 
27 size += GET SIZE(HDRP(PREV.BLKP(bp))) ; 

28 PUT(FTRP(bp), PACK(size, 0); 

29 PUT(HDRP (PREV_BLKP (bp)), PACK(size, 0); 

30 bp = PREV BLKP(bp); 

31 } 

32 

33 else { /* Case 4 */ 
34 size += GET SIZE(HDRP(PREV BLKP(bp))) + 

35 GET_SIZE(FTRP (NEXT_BLKP (bp) )) ; 

36 PUT (HDRP(PREV_BLKP(bp)), PACK(size, 0)); 

37 PUT(FTRP(NEXT_BLKP(bp)), PACK(size, 0)); 
38 bp = PREV_BLKP(bp) ; 

39 } 

40 return bp; : 
41 } 


ac de/vm/mallocim.c 


Figure 9.46 mm. free frees a block and uses boundary-tag coalescing to merge it 
with any adjacent free blocks in constant time. 
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* 
* 


void *mm,malloc(size t size) 

i ns 
size_t asize; /* Adjusted block size */ 
size_t extendsize; /* Amount to extend.heap if no fit */ 
char *bp; 


4 


/* Ignore spurious requests */ 
if (size == Q) 
return NULL; 


oD ON Am PWN — 


_ 
eo 
> 


1 + 


= 
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if (size <= DSIZE) 
asize - 2*DSIZE; 
else 
asize = DSIZE * ((stze + (DSIZE) + (DSIZEn1)) / DSIZE); 


4 


wA ae 
Aa w N 


— 
o 


eo 
» /* Search tthe free list, forsa.fit */ 
cif ((bp = find_fit(asize)): != NULE) ta iy. 
place(bp, asize) ; 
returm bp; Y 
ib rds 
! i 
/* No fit found. Get more memory and place the block */ 
extendsize = MAX(asize,CHUNKSIZE); 
if ¢(bp = extend heap(extendsize/WSIZE)) == NULL) 
return NULL; 
place(bp, asize); 
, return bp; 


=- 
Y 


N= 
oOo w 


NN NN 
whew NN 


/* Adjust block size to include overhead and,aligmient .reqs. 


code/ym/malloc/mm.c 


code/vm/malloc/mm.c 


iin a find fit functio ioi the Simple allocator described in Section! 
9.9.12. 


static void *find_fit(size_t asize) 


Your solution should pérform a first-fit search of the implicit free list. 
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static void place(void *bp, size_t asize) 





Your solution should place the requested block at the beginning of the free : | 
block, splitting only if the size of the remainder would equal or exceed the mini- 4 | 


mum block size. 


9.9.13 Explicit Free Lists 





The implicit free list provides us with a simple way to introduce some basic 
allocator concepts. However, because block allocation time is linear in the total | 
number of heap blocks, the implicit free list is not appropriate for a general- 
purpose allocator (although it might be fine for a special-purpose allocator where 1 | 
the number of heap blocks is known beforehand to be small). 1 | 
A better approach is to organize the free blocks into some form of explicit — MEI 
data structure. Since by definition the body of a free block is not needed by the 4 | 
program, the pointers that implement the data structure can be stored within the | 
bodies of the free blocks, For example, the heap can be organized as a doubly | 
linked free list by including a pred (predecessor) and succ (successor) pointer in | 
each free block, as shown in Figure 9.48. | 
Using a doubly linked list instead of an implicit free list reduces.the first-fit — SEE 
allocation time from linear in the total number of blocks to linear in the number 4j 
of free blocks. However, the time to free a block can be either linear or constant, 
depending on the policy we choose for ordering the blocks in the free list. 
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Figure 9.48 Format of heap blocks that use doubly linked free lists. 
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One approach is to maintain the list in /ast-in first-out (LIFO) order by in- 

serting newly ine blocks at the beginning : of the list. With a LIFO ordering 
and a first-fit p acement policy, the allocator inspects, the most recently used 
blocks first. In this case, freeing a block can be performed in constant: time. 
If boundary tags are used, then coalescing can also be performed in constant 
time. 
"T Apother approach is to maintain the list in address order, where thé address 
of each block i in the list is less than the address of its successor. Ta this case, freeing 
a block: -rpquites a linear-time search to locate the appropriate predecessor. ’ The 
trade-off 1s that address- ordered first fit’ enjoys better’ memory utilization than 
LIFO-ordered first fit, approaching the utilization of best fit. 

A disadvantage of explicit lists in general is that free blocks must be large 
enough to contain all of the Decessary pointers, as well as the header and possibly 
a footer. This results in a larger minimum block síze and increases tbe potential 
for internal fragmentation. 


9.9.14 Segregated Free Lists 


As We have seen, an allocator that uses a single linked list of freé blocks requires 
time linear in the number of free bfocks to allocate a block. A popular approach for 
reducing the allocation time, known generally as segregated storage, is to maintain 
multiple free lists, wheré each list holds blocks that aré rotighly the same size: The 
generalidea is to partition the set of all possible block sizes ihto equivalence classes 
called size classes. There are many ways to define the size classes. For example, we 
might partition the block sizes by powers of 2: 


{1}, (2), (3. 4}, {5-8}, - ++, {1,025-2,048}, (2,049—4,096), (4,097—00) 


Or we might assign small blocks to their own size classes and partition large blocks 
by powers of 2: 


(1), (2), Bh * - - , {1,023}, {1,024}, (1,025-2,048], {2,049-4,096}, {4,097-00} 


The allocator maintains an array of free lists, with one free list per size class, 
ordered by increasing size. When the allocator needs a block of size n, it searches 
the appropriate free list. If it cannot find a block that fits, it searches the next list, 
and so on. 

The dynamic storage allocation literature describes dozens of variants of seg- 
regated storage that differ ir how they define size classes, when they perform 
coalescing, when they request additional heap memory from the operating sys- 
tem, whether they allow splitting, and so forth. To give you a sense of what is 
possible, we will describe two of the basic approaches: simple segregated storage 
and segregated fits. 
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Simple Segregated Storage 


With simple segregated storage, the tree list for each size class contains same-size 
blocks, each the size of the largest element of the size class. For example, if some 
size class is defined ag {17-32}, then the free list for that class consists entirely of 
blocks of size 32. 

To allocate a block of some given size, we check the appropriate free list. Ifthe 
list is not empty, we simply allocate the first block in its entirety. Free blocks are 
never split to satisfy allocation requests. If the list is empty, the allocator requests 
a fixed-size chunk of additional memory from the operating system (typically a 
multiple of the page size), divides the chunk into equal-size blocks, and links the 
blocks together to form the new free list. To free a block, the allocator simply 
inserts the block at the front of the appropriate free list. 

There are a number of advantages to this simple scheme. Allocating and 
freeing blocks are both fast constant-time operations. Further, the combination 
of the same-size blocks in each chunk, no splitting, and no coalescing means that 
there is very little per-block memory overhead. Since each chunk has only same- 
size blocks, the size of an allocated block can be inferred from its address. Since 
there is no coalescing, allocated blocks do not need an allocated/free flag in the 
header. Thus, allocated blocks require no headers, and since there is no coalescing, 
they do not require any footers either. Since allocate and free operations insert 
and delete blocks at the beginning of the free list, the list need only be singly 
linked instead of doubly linked. The bottom line is that the only required field in 
any block is a one-word succ pointer in each free block, and thus the minimum 
block size is only one word. 

A significant disadvantage is that simple segregated storage is susceptible to 
internal and external fragmentation. Internal fragmentation is possible because 
free blocks are never split. Worse, certain reference patterns can cause extreme 
external fragmentation because free blocks are never coalesced (Practice Prob- 
lem 9.10). 


UE — V VEL RET NRE Ru o HEN [ll ur M an Mec iet) hs 09 nta n ingens qu 
Prattité Prolilem 3-10: Goludou page BEST, S ose Xu 
Describe a reference pattern that results in severe external fragmentation in an 
allocator based on simple segregated storage. 





Segregated Fits 


With this approach, the allocator maintains an array of free lists. Each free list is 
associated with a size class and is organized as some kind of explicit or implicit list. 
Each list contains potentially different-size blocks whose sizes are members of the 
size class. There are many variants of segregated fits allocators. Here we describe 
a simple version. 

To allocate a block, we determine the size class of the request and do a first- 
fit search of the appropriate free list for a block that fits. If we find one, then we 
(optionally) split it and insert the fragment in the appropriate free list. If we cannot 
find a block that fits, then we search the free list for the next larger size class. We 
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repeat until we find a block that fits. If none of the free lists yields a block that fits, 
then we request additional heap memory from the operating system, allocate the 
block out of this new heap memory, and place the remainder in the appropriate 
size class. To free a block, we coalesce and place the result on the appropriate 
free list. 

The segregated fits approach is a popular choice with production-quality 
allocators such as the GNU malloc package provided in the C standard library 
because it is both fast and memory efficient. Search times are reduced because 
searches are limited to particular parts of the heap instead of the entire heap. 
Memory utilization can improve because of the interesting fact that a simple first- 
fit search of a segregated free list approximates a best-fit search of the entire heap. 


Buddy Systems , 


A buddy system is a special, case of segregated fits where each size class is a power 
of 2. The basic idea is that, given a heap of 2” words, we maintain a separate free 
list for each block size 2*, where 0 € K € m. Requested block sizes are rounded up 
to the nearest power of 2. Originally, there is one free block of size 2" words. 

To allocate a block of size 2*, we find the first available block of size ÈJ, such 
thatk x j <m. If j =k, then we are done. Otherwise, we recursively split the block 
in half until j = k. As we perform this splitting, each remaining half (known as a 
budd y) i is placed on the appropriate free list. To free a block of size 2%, we continue 
coalescing with the free. buddies. When we encounter an allocated buddy, we stop 
the coalescing. 

A key fact about buddy systems is that, given the address and size of a block, 
it is easy to compute the address of its buddy. For example, a block of size 32 bytes 
with address 


xxx ...x00000 
has its buddy at address 
xxx ...x10000 


In other words, the addresses of a block and its buddy differ in exactly one bit 
position. 

"The major'advantage of a buddy system allocator is its fast searching and 
coalescing. The major disadvantage i is that the power-of-2 requirement on the 
block size can cause significant internal fragmentation. For this reason, buddy 
system allocators are not appropriate for general-purpose workloads. However, 
fot! certain application-specific workloads, Where the block sizes are known in 
advance to be powers of 2, buddy system allocators have a certain appeal. 


9.10 Garbage Collection 


With an explicit allocator such as the C malloc package, an application allocates 
and frees heap block&'by making calls to malloc and free. It i is the application’s 
responsibility to free any allocated blocks that it no longer needs. 
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Failing to free allocated blocks is a common programming error. For example, 
consider the following C function that allocates a block of temporary storage as 
part of its processing: 


1 void garbage() 

2 3 

3 int *p = (int *)Malloc(15213) ; 

4 « 

5 return; /* Array p is garbage at this point */ 
6 } 


Since p is no longer needed by the program, it should have been freed before 
garbage returned. Unfortunately, the programmer has forgotten to free the block. 
It remains allocated for the lifetime of the program, needlessly occupying heap 
space that could be used to satisfy subsequent allocation requests. 

A garbage collector is a dynamic storage allocator that automatically frees al- 
located blocks that are no longer needed by the program. Such blocks are known 
as garbage (hence the term “garbage collector”). The process of automatically 
reclaiming heap storage is known as garbage collection. In a system that supports 
garbage collection, applications explicitly allocate heap blocks but never explic- 
itly free them. In the context of a C program, the application calls malloc but 
never calls free. Instead, the garbage collector periodically identifies thé garbage 
blocks and makes the appropriate calls to free to place those blocks back on the 
free list. 

Garbage collection dates back to Lisp systems developed by John McCarthy 
at MIT in the early 1960s. It is an important part of modern language systems such 
as Java, ML, Perl, and Mathematica, and it remains an active and important area of 
research. The literature describes an amazing number of approaches for garbage 
collection. We will limit our discussion to McCarthy’s original Mark &Sweep al- 
gorithm, which is interesting because it can be built on top of an existing malloc 
package to provide garbage collection for C and C++ programs. 


9.10.1 Garbage Collector Basics 


A garbage collector views memory as a directed reachability graph of the form 
shown in Figure 9.49. The nodes of the graph are partitioned into a set of root 
nodes and a set of heap nodes. Each heap node corresponds to an allocated block 
in the heap. A directed edge p — q means that some location in block p points to 
some location in block q. Root nodes correspond to locations not in the heap that 
contain pointers into the heap. These locations can be registers, variables on the 
stack, or global variables in the read/write data area of virtual memory. 

We say that a node p is reachable if there exists a directed path from any root 
node to p. At any point in time, the unreachable nodes correspond to garbage that 
can never be used again by the application. The role of a garbage collector is to 
maintain some representation of the reachability graph and periodically reclaim 
the unreachable nodes by freeing them and returning them to the free list. 
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Figure 9.49 Root nodes 
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Garbage collectors for languages like ML and Java, which exert tight control 
over how applications create and use pointers, can maintain an exact representa- 
tion of thé reachability graph and thus can reclaim all garbage. However, collectors 
for languages like C and C++ cannot in general maintaifi exact representations 
of the reachability graph. Such colléctors are known as conservative garbage col- 
lectors. They are conservative in the sense that each reachable block is correctly 
identified as reachable, while some unreachable nodes might be incorrectly iden- 
tified as reachable. 

Collectors can provide their service gn demand, or they can run as separate 
threads in parallel with the application, continuously updating the reachability 
graph and reclaiming garbage. For example, consider how we might incorporate a 
conservative collector for C programs ihto an existing malloc package, as shown 
in Figure 9.50. 

The application calls malloc in the usual manner’ whenever it needs heap 
space. If malloc is unable to find a free block that fits, then it calls the garbage col- 
lector in hopes of reclaiming some garbage to the free list. The collector identifies 
the garbage blocks and returns them to the heap by calling the free function. The 
key idea is that the collector calls free instead of the application. When.the call 
to the collector returns, malloc tries again to find a free block that fits. If that fails, 
then it can ask the operating system for additional memory. Eventually, malloc 
returns a pointer to the requested block (if successful) or the NULL pointer (if 
unsuccessful). 


9.10.2 Mark&Sweep Garbage Collectors 


A Mark&Sweep garbage collector consists of a mark phase,.which marks all 
reachable and allocated descendants of the root nodes, followed by a sweep phase, 
which frees each unmarked allocated block. Typically, one of the spare low-order 
bits in the block header is used to indicate whether a block is marked or not. 


Unreachable 
(garbage) 
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(a) mark function 


void mark(ptr p) { 


(b) sweep function 


void sweep(ptr b, ptr end) { 


if ((b = isPtr(p)) == NULL) while (b < end) { 


return; if (blockMarked(b)) 
if (blockMarked(b)) ünmarkBlock(b); 
return; else if (blockAllocated(b)) 
markBlock(b) ; free(b); 
len = length(b); b = nextBlock(b); 
for (i=0; i < len; i++) } 
mark(b[i]); return; 
return; I 


} 


Figdre 9.51 Pseudocode for;the mark arid sweep functions. 


- 


Our description of Mark&Sweep will assume the following functions, where 
ptr is defined as typedef void *ptr: 


ptr isPtr(ptr p). If p points to some Word in an allocated block, it returns a 
pointer b to the beginning of that block. Returns NULL otherwise. 

int blockMarked(ptr b). Returns true if block b is already marked. 

int blockAllocated(ptr b). Retirns trué if block bis allocated. 

void markBlock(ptr b). Marks block b. 

int length(ptr b). Reina the length in’ words (excluding the’ header) of 


^ block b. 7 
void unmarkBlock (ptr b). Changes the status of block b from marked to un- 
marked. x 


ptr nextBlock (ptr b)!. Returns the successor of block b in the'heap. 


. 3 The mark phase calls the mark function shown im Figure :9.51(a) ‘once for 
each root-node. The mark function returns immediately if p does not point to 
an allocated and unmarked heap block.:Otherwise, it marks the block and calis 
itself recursively on each word in block. Each call to the mark function marks any 
unmarked and reachable descendants.of some.root node. At the end of the mark 
phase, any allocated block that is not marked is guaranteed to be unreachable and, 
hence, garbage that can be reclaimed-in the sweep phase. m 

The sweep phase is a single call to the sweep function shown in Figure 9.51(b). 
The sweep function iterates over each block in the heap, freeing any unmarked 
allocated blocks (i.e., garbage) that it encounters, 

Figure 9.52 shows a graphical interpretation'of Mark&Sweep for a small heap. 
Block boundaries are indicated by heavy:lines; Each square corresponds to a 
word of memory: Each block has-a one-word header, which is either marked or 
unmarked. 
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Initially, the heap in Figure 9.52 consists of six allocated blocks, each of-which 
is unmarked. Block 3 contains a pointer to block 1. Block 4 contains pointers to 
blocks 3 and 6. The root points to block 4. After the mark phase, blocks 1,3, 4, and 6 
are marked because they are reachable from the root. Blocks 2 and 5 are unmarked 
because they are unreachable. After the sweep phase, the two unreachable blocks 
are reclaimed to the free list. 


9.10.3 Conservative Mark&Sweep for C Programs 


Mark &Sweep is an appropriate approach for garbage collecting C programs be- 
cause it worksin place without moving any.blocks. However, the C language,poses 
some interesting challenges for the implementation.of the isPtr function. 

First, C does not tag memory, locations with any type information. Thus, there 
is no obvious way for isPtr to determine if its input parameter pis a pointer or not. 
Second, even if we were to know that p wås a pointer, there would be no:obvious 
way for isPtr to determine whether p-points to some-location in the. payload of 
an allocated block. 

One solution to the latter problem is to maintain the set of allocated blocks 
as a balanced binary tree that maintains the invariant that all blocks in the left 
subtree are located at smaller addresses and all blocks in the right subtree are 
located in larger addresses. As shown in Figure 9.53, this requires two additional 
fields (1eft and right)in the header of each allocated block. Each field points to 
the header of some allocated block. The isPtr(ptr p) function uses.the tree to 
perform a binary search of the allocated blocks. At each step, it relies on the size 
field in the block header to determine if p falls within the extent of the block. 


Unmarked block 
neader 


Marked block 
header 
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The balanced tree approach is correct in the sense that it is guaranteed to mark 
all of the nodes that are reachable from the roots. This is a necessary guarantee, 
as application users would certainly not appreciate having their allocated blocks 
prematurely returned to the free list. However, it is conservative in the sense that 
it may incorrectly mark blocks that are actually unreachable, and thus it may fail | 
to free some garbage. While this does not affect the correctness of application : 
programs, it can result in unnecessary external fragmentation. 

The fundamental reason that Mark&Sweep collectors for C programs must | 
be conservative is that the C language does not tag memory locations with type | 
information. Thus, scalars like ints or floats can masquerade as pointers. For 
example, suppose that some reachable allocated block contains an int in its 
payload whose value happens to correspond to an address in the payload of some 
other allocated block b. There is no way for the collector to infer that the data is 
really an int and not a pointer. Therefore, the allocator must conservatively mark 
block b as reachable, when in fact it might not be. 














9.11 Common Memory-Related Bugs in C Programs 


Managing and using virtual memory can be a difficult and error-prone task for 
C programmers. Memory-related bugs are among the most frightening because 
they often manifest themselves at a distance, in both time and space, from the 
source of the bug. Write the wrong data to the wrong location, and your program 
can run for hours before it finally fails in some distant part of the program. We 
conclude our discussion of virtual memory with a look at of some of the common 
memory-related bugs. 












9.11.1 Dereferencing Bad Pointers 





As we learned in Section 9.7.2, there are large holes in the virtual address space ofa 
process that are not mapped to any meaningful data. If we attempt to dereference 
a pointer into one of these holes, the operating system will terminate our program 
with a segmentation exception. Also, some areas of virtual memory are read-only. 
Attempting to write to one of these areas terminates the program with a protection 
exception. 

A common example of dereferencing a bad pointer is the classic scanf bug. 
Suppose we want to use scanf to read an integer from stdin into a variable. 
The correct way to do this is to pass scan£ a format string and the address of the 
variable: 



















scanf("%d", &val) 







However, it is easy for new C programmers (and experienced ones too!) to pass 
the contents of val instead of its address: 






scanf("Ad", val) 
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In this case, scanf wilkinterpret,the cohtents of val as an address and attempt to 
write a word to that location. In the best case, the program terminates inimediately 
with an exception. In the worst case, the contents of val correspond to some 
valid read/Write area of virtual memory, and we overwrite memory, usually with 
disastrous and baffling consequences much later. 


A 


9.11.2 Reading Uninitialized Memory 1 


While bss memory locations (such as uninitialized global C variables) are always 
initialized to zeros by the loader, this is not true for héap memory. À common 
error is to assume that heap memory is initialized to zero: 


1  /* Return y = Ax */ 

2 int *matvec(int **A, int *x, int n) 

3 dq 

4 int i, jr; 

5 $ 

6 int *y = (int *)Malloc(n * sizeof (int)); 
7 "I 

28 for (i 0; i < n; i++) 

9 i for (j= 0; j < n; j++) 1 

io yll += ALi][j] *ix[jl; 

11 return y; ty 
i2 Ji tf 1 


af , 


In this example, the programmer has incorrectly assumed that vector y has bderi 
initialized to zero. A correct implementation would explicitly zero y[i] or use 
calloc. i s 


" A ie 
9.11.3 Allowing Stack Buffer Overflows 


As we saw in Section 3.103, a program has a buffer overflow bug if it writes 
to a target buffer on the stack without examining the size of the input string. 
For example, the following function has a buffer overfiow bug because the gets 
function copies an arbitraty-length string to the buffer. To fix this, we would need 
to use the £gets function, which limits the size of the input string. 


t 


void bufoverflow() 


{ 
char buf [64]; 


i 
gets(buf); /* Here is the stack buffer overflow bug */ 
return; 


N A Wea WH 
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9.11.4 Assuming That Pointers and the Objects They. Point to 
, Are the Same Size . 


One common mistake is to assume that pointers to objects are the same size as 
the objects they point to: 7 


1 /* Create an nxm array */ 

2 int **makeArrayi(int n, int m) : 

3 1 + Hj Ur it , 
4 int i; 

5 int **A = (int **)Malloc(n * sizeof(int)); 

6 

7 for (i = 0; i < n; itt) 

8 A{i] = (int *)Malloc(m * sizeof (int)); 

9 return A; 

0 


} 


The intent here is to create an array of n pointers, each of which points to an array 
of m ints. However, because the programmer has written sizeof (int) instead 
of sizeof (int +) in line 5, the code.actually creatés.an array of ‘ints. 

This code will run fine on machines where ints and pointers to ints are the 
same size. But if we run this code on a machine like the Core i7, where a pointer is 
larger than an int, then the loop in lines 7-8 will write past the end of the A array. 
Since one of these words will likely be the boundary-tag footer of the allocated 
block, we may not discover the error until we free the block much later in the 
program, at which point the coalescing code in the allocator will fail dramatically 
and for no apparent reason. This is an insidious example of the kind of "action at 
a distance” that is so typical.of memory-related programming bugs. 


E 


1 


9.11.5 Making Off-by-One Errors 


Off-by-one errors are another common source of overwriting bugs: 


` wo 50 « 


/* Create an nxm array */ 


1 

2 int **mákeArray2(int n, int m) . 
3n A 

4 inti; , , " ; 

5 int **A = (int **)Malloc(n * sizeof (int, *)); 

6 

7 for (i = 0; i <= n; itf) 

8 ALi] = (int *)Malloc(m * sizeof(int)); 

9 return A; 

10 } 


This is another version of the program in the previous section. Here we have 
created an n-element array of pointers in line 5 but then tried to initialize n + 1 of 
its elements in lines 7 and 8, in the process overwriting some memory that follows 
the A array. 
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9.11.6 Referencing a Pointer Instead of the Object It Points To 


If we are not careful.about.the precedence and associativity ofiC-operators, thet 
we incorrectly manipulate a pointer instead of the object it points to. For example; 
consider the following function, whose purpose is to remove the first item in a 
binary heap of *size items and then reheapify the remaining *size - 1 items: 


1 int *binheapDelete(int **binheap, int *size) 
2 t 
3 int *packet = binheap[0]; 
4 
5 binheap[0] = bijnheap[*size.- 1]; 
H5 *size--; /* This should be, (*size)-- */ 
7 heapify(binbeap, *size, 0); > " 
return(packet) ; 1 
} ta 1 Ye +f 
4 r 


In line 6, the intent is to decrement the integer value pointed to by the size 
pointer. However, because the unary- and * operators have the same precedence 
and associate from right to left, the code in line 6 actually decrements the pointer 
itself instedd of the integer value that it points.to. If we are lucky, the program.will 
crash immediately. But more likely we will be left scratching.our heads when the 
program produces an incorrect answer much later in'its exécution; The moral here 
is to use parentheses whenever in doubt about precedence and associativity. For 
example, in line 6, we should-have clearly stated our intent by using the expression 
(*size)--. 


9.11.7 Misunderstanding Pointer Arithmetic 

3 
Another common mistake is to.forget that arithmetic operations on pointers are 
performed in units that are the size of the objects they point to, which are not 
necessarily bytes. For example, the intent of the following function is to scan an 
array of ints and return a pointer to the first occurrence of val: 


1 int *search(int *p, int val) 

2 1 

3 while p && *p !- val) 

4 P += sizeof(int); /* Should be pt+ x/ 
5 return p; 

6 } 


However,because line 4 increments the pointer'by'4 (the numberyof bytes in an 
integer) edch time through the loop, the function incorrectly scans evéry fourth 
integer in the array. i 
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9.11.8 Referencing Nonexistent Variables 





Naive C programmers who do not understand the stack discipline will sometimes 
reference local variables that are no longer valid, as in the following example: 
p i " 


1 int *stackref () 
2 t 

3 int val; 

4 

5 return &val; 
6 +} 


This function returns a pointer (say, p) to a lócal variable on the stack and 
then pops its stack frame. Although p still points to a valid memory address, it 
no longer points to a valid variable. When other functions are called later in the 
program, the memory will be reused for their stack franies. Later, if the program 
assigns some value to *p, then it might actually be modifying an entry in another 
function's stack frame, with potentially disastrous and baffling consequences. 
9.11.9 Referencing Data in Free Heap Blocks : i 

+ a t 
A similar error is to reference data-in heap blocks that have already-been freed. 
Consider the following example, whicir allocates an integer array x in line 6, 
prematurely frees block in line 10, and then later references it.in-line 14: 


t 

1 int *heapref(int nj int m) 

2- 1 

3 int i; 

4 int *x, *y; 

5 : th vf 
6 x = (int *)Malloc(n * sizeof(int)); 

7 

8 // Other calls to malloc and free go here 

9 

10 free(x); 

11 

12 y = (nt *)Malloc(m * sizeof(int)); 

13 for (i = 0; i < m; i++) 

14 yli] = x[i]++; /* Oops! x[i] is a word in a free block */ 
15 

16 return y; 

Á7 3 


Depending on the pattern of malloc and free calls that occur between lines 6 
and 10, when.the program references x [i] in line 14; the array x might be part of 
some other allocated heap block and may have been overwritten. As with many 
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memory-related bugs, the error will only become evident later in the program 
when we notice that the values in y are corrupted. 2 


9.11.10 Introducing Memory Leaks 


Merhory leaks are slow, silent killérs ‘that occur when programmers inadvertently 
create garbage in the heap by forgetting'to free allocated blócks. For example, the 
following fünction allocates à heap block x and then returhs without freeing it: 


P 


1 void leak(int n) ^ u a 

2 4 a : 
3 int'*x'- (int *)Malloc(n * sizeóf(int)); 
4 t ‘ 1 

!5 return; /* x is garbage at this point’ #/ 
6 y : “ 


tu 1 
If leak is called frequently, then the heap will gradually fill up with garbage, 
in the worst case consuming the entire virtual address space. Memory leaks are 
particularly serious for programs such as daemons and servers, which by,defjnition 
never terminate. 


9.12 Summary 


te t n 


Virtual memory is an abstraction of: main memory. Processors. that'support vir- 
tual memory refererice‘main memory usińg a form of indirection known as virtual 
addressing. The processor generates a virtual address, which is translated into a 
physical address before being sent to the main: memory. The translation of ad- 
dresses from a virtual address space to.a physical address space' requires close 
cooperation between hardware and software! Dedicated hardware translates yir- 
tual addresses using page. tables -whose contents'are!'supplied by the operating 
system. 1 

Virtual memory provides three important capabilities. Firs&iit automatically 
caches recently used contents of the virtual. address space stored on diskin main 
memory. The block in a-virtüal merhory cache is known: asia page. A reference 
to:a page on disk triggers a page fault that transfers centrol to a fault handler 
in the operating system. The fault handler copies the page from disk to the main 
memory cache, writing back the evicted page if necessary. Second, virtual rhemory 
simplifies memory management, which in turn simplifies linking, sharing data 
between processes, the allocation of memory for processes, and program loading, 
Finally, virtual memory simplifies memory protection by incorporating protection 
bits into every page table entry. 

The process of address translation. must be integrated with the operation of 
any hardware cachés in the system. Most page table entries are located in the L1 
cache, but the cost of accessing page.table entries from L1 is usually:eliminated 
by an on-chip cache of page table entries called a TLB. 
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Modern systems initialize chunks of virtual memory by,associating them with 
chunks of files on disk, a process known as memory mapping. Memory mapping 
provides an efficient mechanism for sharing data, creating new processes, and 
loading programs. Applications can manually create and delete areas of the virtual 
address space using the mmap function. However, most programs rely on a dynamit 
memory allocator such as malloc, which manages memory in an.area of the virtual 
address space called the heap. Dynamic memory allocators are application-level 
programs with a system-level feel, directly manipulating memory without much 
help from the type system. Allocators come in two flavors. Explicit allocators 
require applications to explicitly free their memory blocks. Implicit allocators 
(garbage collectors) free any unused and unreachable blocks automatically. 

Managing and using memory,is a difficult and error-prone task for C program- 
mers. Examples of common errors include dereferencing bad pointers, reading 
uninitialized memory, allowing stack buffer overflows, assuming that pointers and 
the objects they point to are the same size, referencing a pointer instead of the 
object it points to, misunderstanding pointer arithmetic, referencing nonexistent 
variables, and introducing memory leaks. * 


£ [s 
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Homework Problems 


9.11 @ 

In the following series of problems, you are to show how the.example memory 
system in Section 9.6.4 translates a virtual address into a physical address and 
accesses the cache. For the given virtual address, indicate the TLB entry accessed, 
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the physical address, and the cache byte value returned. Indicate whether the TLB 
misses, whether a page fault occurs, and whether a cache miss occurs. If there is 
a cache miss, enter “—” for “Cache byte returned.” If there is a page faulf; enter 
*—" for “PPN” and leave parts C and D blank. 


Virtual address: 0x027c 


A. Virtual address format 


13 12 11 109 8 7 6 5 4 3 2 1 O0 


B. Address translation 


Parameter Value 
VPN 

TLB index 
TLB tag 

TUB hit? (Y/N) 
Page fault? (YIN) |— 
PPN "RE 


C. Physical address format 4 » 


D. Physical memory reference ^ 


Parameter Value 


Byte offset 

Cache index VIERNES 
Cache tag 

Cache hit? (Y/N) 

Cache byte returned 


9.12 9 
Repeat Problem 9.11 for the following address. 


Virtual address: 0x03a9 


A. Virtual address format 


13 12 11 i08 8 7 6 5 4 3 2 1 O0 


877 
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B. Address translation 

j at hi t 
Parameter Value 
VPN EAS : 
TLB index 
TLB tag 
TLB hit? (Y/N) 
Page fault? (Y/N) 
PPN 


C. Physical address format 


D. Physical memory reference 


Parameter Value 
Byte offset 

Cache index 

Cache tag ETE 
Cache hit? (Y/N) TES 
Cache byte returned 


9.13 € 
Repeat Problem 9.11 for the following address. 


Virtual address: 0x0040 : 


13 12 1 10 9 8 7 6 5 #4 3 2 1 0 


Eos 3p TE Ee pes] 


A. Address translation 


Parameter Value 


VPN 

TLB index 
TLB tag 

TLB hit? (Y/N) 
Page fault? (Y/N) — 
PPN 


B. Physical address format 


1 10 9 8 7 6 5 4 3 2 1 O0 


afee ee ee 
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C. Physical memory reference 


Parameter Value ' 2 
Byte offset ste 

Cache index 

Cache tag — 

Cache hit? (Y/N) —— 


1. Cache byte returned Lat. 


9.14 oo 
Given an input file hello.txt that consists ‘ot the string Hello , world! \n, write 
aC program, that uses mmap to change the contents of hello. txt’ tö Jello, 
world! \n. 


9.15 " * s i i 
Determine the block sizes.and header,jvalues that would result. from the fol- 
lowing sequence of malloc requests. Assumptions:. (1) The allocator maintains 
double-word. alignment and uses an implicit free list with the block format from 
Figure 9.35. (2) Block sizes are rounded up to the nearest multiple of 8 bytes. 











Request Block size (décimal bytes} Block header (hex) 
malloc(3) LEGI ——— 
malloc(11) C a ae moe 1 
malloc(20) T NC ee VINE 
malloc(21), RE gett, durae 
9.16 9 


Determine thé minimurii block sizé for each of the following combinations of 
alignment requireménts and blóck formats. Assumptions: Explicit free list, A!byte 
pred and succ pointers in each free block, zero-size payloads are not allowed, and 
headers and footers are stored in 4-byte words. 


Minimum block ' 
Alignment Allócated block Free block ‘te size (bytésy 
Single word Header and footer Header and footer : 
Single word Header, but no footer Header and footer m 
Double word Header and footer Header and footer " = 
Double word Header, but no footer Headét aiid footer 
9.17 999 
Develop a vérsion of the allocator in Séction 9.9:12 thát performs a next-fit search 
instead of a first-fit search. # 
9.18 999 i à ul 


The allocator in Section 9.9.12 requires both a header and a footer for-each bloc 
in order to perform constant-time coalescing. Modify the allocator so that free 
blocks require a header and a footer, but allocated blocks require only a header. 
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9.19 € 

You are given three groups of statements relating to memory management and 
garbage collection below. In each group, only one statement is true. Your task is 
to indicate which statement is true. 


1. (a) In a buddy system, up to 50% of the space can be wasted due to internal 

fragmentation. phe ox 

(b) The first-fit memory allocation algorithm is slower than the best-fit algo- 
rithm (on average). 

(c) Deallocation using boundary tags is fast only when the list of free blocks 
is ordered according to increasing memory addresses. 

(d) The ‘buddy system suffers from internal fragmentation, but not from 
external fragmentation. 


2. (a) Using the first-fit algorithm on a free list that is ordered according to 
decreasing block sizes results in-low performance for allocations, but 
avoids external fragmentation: 

(b) For tlie best-fit method, thé list df free blocks should be ordered according 
to increasihg memory addresses.! 

(c) The best-fit method chooses the largest free block into which the re- 
quested segment fiis. 

(d) Using the first-fit algorithm on a free list that is ordered according to 
increasing block sizes is equivalent to using the best-fit algorithm. 


3. Mark&Sweep garbage collectors are called conservative if 
(a) They coalesce freed memory only when a memory request cahnot be 
satisfied. 
(b) They treat everything that looks like a: ‘pointer g asa pointer. 
(c) They ‘perform garbage collegtion c only when they run out of memory. 
(d) They do not free memory blocks forming a cyclic list. 


at 1 b m 


9.20 004 
Write your own version of mallog and free, and.compare its gunning time and 


space "utilization to the version of nalloc provided in the standard C library. 
CL 


t 1 r te 
3 x rd the La 
Solutions to Practice Problems 


Solution to Problem 9.1 (page 805) - 

This problem gives you some appreciation for the sizes of different address spaces. 
At ong. point in time, a 32-bit address space seemed impossibly. large. But.now 
there are database and scientific applications that need more, and you can expect 
this trend to continue. At some point in your lifetime, expect to find" yourself 
coniplaining about the cramped 64-bit address space omyour personal'computer! 


e 


F E ef . t it) 








Solutions to Practice Problems 881 


Number of Number of K 
address bits (n) virtual addresses (N) Largest possjble virtual address 
8 28 — 256 28 —1—255 
16 216 — 64K 215 .1—64K —1 
32 2224G 22 ..124G-1 
48 m = 256T 28 1=256T <1 
64 64 — 16,384 P 29 _ 1 = 16,384 P — 1 


Solution to Problem 9.2 (page 807) 
Since each virtual page is P = 2? bytes, there are a total of 2” /2? = 2^- P possible 
pages in the system, each of which needs a page table entry (PTE). 


n P=2? Number of PTEs 





16 4K 16 

16 8K 8 

32 4K 1M ; 

32 8K 512K 

Solution to Problem 9.3 (p&ige 816) i 


You need tó understand'this kind of problem well in order to fully grasp address 
translation. Here is how to solve the first subproblém: We are given n = 32 Virtual 
address bits and m — 24 physical address bits. A page size of P — 1KB means we 
needlog;(1K) = 10 bits for both the VPO and PPO. (Recall that the VPO and PPO 
are identical.) The remaining address bits are the VPN'and PPN; respectively. 








Number of 
P VPN Bits VPO bits PPN bits ^ PPO bits 
1KB 22 10 14 10 
2KB 21 11 13 11 
4KB 20 ` 12 12 12 
8KB 19 13 11 13 


Solution to Problem 9.4 (page 824) 

Doing a few of these manual simulations is a great way to firm up your understand- 
ing of address translation. You might find it helpful to write out all the bits in the 
addresses and then draw boxes around the different bit fields‘ such as VPN, TLBI, 
and so on. In this particular problem, there.are na‘misses of any kind: the TLB 
has a copy of the PTE and the cache has a copy of the requested data words. See 
Problems 9.11, 9.12, and 9.13 for some different combinations.of hits and misses. 


Ease 


T TET E aim 
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A. 00 0011 1101 0111 





B. Parameter ] Value 
VPN Oxf 
TLB index 0x3 
TLB tag 0x3 
TLB hit? (Y/N) Y 
Page fault? (Y/N) N 
PPN ; Oxd 


C. 0011 0101 0111 





D. Parameter Value i 
Byte offset 0x3 
Cache index 0x5 
Cache tag Oxd . 
Cache hit? (Y/N) Y 
Cache byte returned Oxid 4 


Solution to Problem 9.5 (page 839) 
Solving this problem will give you a good feel for the idea of memory mapping. 
Try it yourself. We haven't discussed the open, fstat, or write functions, so you'll 
need to read their man pages to.see how they work. 

: T 


> 
1 T F 3 


: 4 code/ym/mmapcopy.c 
1 Htinclude' éSapp. bh" s i 
2 
3 f* 
4 * mmapcopy - uses mmap to copy file fd to stdout 
5 */ y 
6 void mmapcopy(int fd, int size) 
; id r 
8 char *bufp; /* ptr to memory-mapped VM area */ 
9 
10 bufp = Mmap(NULL, size, PROT_READ, MAP PRIVATE, fd, 0); 
11 Write(1, bufp, size); 
12 return; ward 
13. ë J# L ‘ 
14 JH 
15! /* mmapcopy driver’*/ ^ 
16 int main(int argc, clar **argv) t 
17 ( vi : i 1 . 
18 struct stat stat; E f 
19 int fd; 
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21 /* Check for required command-line argümént.x// .* » 
22 if (argc != 2) { > g Pos, Z 
23 printf("usage: %s.<filendme>\n", argv[0]); ! 
24 exit(0); Li t q 
25 } i 
26 TE 
27 /* Copy the input argument to stdout */ | 
28 fd = Open(argv[1], O_RDONLY, 0); uS 
29 fstat(fd, &stat); [ 1 : 
30 mmapcopy(fd, stat.st size); i 
31 exit(0); B | 
32 } 3L 1 | 

l 2 : code/vm/mmapcopy.c a 





Solution to Problem 9.6 (page 849) | um 
This problem touches on some core ideas such as alignment requirements, min- } j | 
imum block sizes, and header encodings. The general approach for determining ; 2E 
the block size is to round the sum of the requested payload and the header size.. i | 
to the nearest multiple of the alignment requirement (in this case, 8 bytes). For 1 
example, the block size for the malloc(1) request is 4+ 1 = 5 rounded up to 8. | | 
The block size for the malloc (13) request is 13 + 4 — 17 rounded up to 24. 
, 
| 
| 


n 


pt 
zı Block size (decimal bytes) Block header (hex) 





Request , : . | 
mallo¢(1) 8 ' Ox9' X r ; if 
mallóc(5) 16 " 0x11 |j 
malloc(12) ie 0x11 | 1 
nalloc(i3) "24 0x19 


id j 3 
Solution to Problem 9.7 (page 852) j 
The minimum block size can have a significant effect on internal fragmentation. : | 
Thus, it is good to understand the minimum block sizes associated" with different j 
allocator designs and alignment requirements. The tricky part is to realize that the | i 
same block can be allocated or free at different points ifi tirne: Thus, the minimum 
block size is the maximum of the minimum allocated block size and the minimum id 
free block size. For example, in the last subprobleni, the minimum allocated block H 
size is a 4-byte header and a 1-byte payload rounded up to 8 bytes. The minimum 4 
free block size is a 4-byte header arid 4-byte footer, which is already a multiple of 

8 and doesn’t need to be rounded. So the minimum block size for this allocator is 1 


8 bytes. = [ 
d 

Minimum block E 

Alignment Allocated block Free block size (bytes) E 

Single word Header and footer Header and footer 12 ME 
Single word Header, but no footer Header and footer 8 : 
Double word Header and footer Header and footer 16 | q 


Double word Header, but no footer Header and footer 8 i 4 ie 
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Solution to Problem 9.& (page 861) t 2 

There is nothing very tricky here. But the solution requires you to understand 
how the rest of our simple implicit-list allocator works and how to manipulate 
and traverse blocks. 


LM —— — —- code/vm/malloc/mm.c 
static void *#find_fit(size_t-asize) 
1 
/* First-fit search */ : 
void *bp; . 
for (bp = heap .listp; GET SIZE(HDRP(bp)) > 0; bp = NEXT. BLKP(bp)) { 
if (!GET ALLOC(HDRP(bp)) && (asize <= GET SIZE(HDRP(bp)))) i 
return bp; 
e? 2 
} 
return NULL; /* No’ fit ‘*/* o" "t 
#endif " 
} 


o S o a cgde/fvin/malloc/mm.c 


a 1 


Solution to Problem 9.9 (page 861) 

This is another warm-up éxercise to help you tecome~fantiliar’ with allocators. 
Notice that for this allocator the minimum block size is 16 bytes. If the remainder 
of the block after splitting would be greater than or, equal to the minimum block 
size, then we go ahead and split the block (lines 6-10). The only tricky part here 
is to realize that you need fo,place the new allocated block (lines 6 and 7) before 
moving to the next block (line 8). 


i nde 
SSS ee code/vin/malloc/mm.¢ 
1 static void place(void *bp, size_t asize). ' idi 
z {d 3 Pog " b: 
3 size_t csize;= GET_SIZE(HDRP(bp)); + 
4: fus ^ ta 
5 if ((csige - abize) x- (2*DSIZE)) t 
6 PUT(HDRP (bp) ,. PACK(asize, 12); 
7: 5i PUT(FTRP(bp), PACK(asize, 12); 
8 bp = NEXT_BLKP(bp) ; Sect d 
9 PUT(HDRP(bp), PACK(csize-asize, 0)); 
10 PUT(FTRP(bp), PACK(csize-asize, 0)5; 
n } 
12 dise { un 
13 PUT(HDRP(bp), PACK(csize, 1));° : 
14 PUT(FTRP (bp), PACK(csize, 1)); 
15 } : 
16. + 


—————————————————— code/vm/malloc/mm.c 
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Solution to Problem 9.10 (page 864) 

Here is one pattern that will cause external fragmentation: The application makes 
numerous allocation and free requests to the first size class, followed by numer- 
ous allocation and free requests-to the second size class, followed by numerous 
allocation and free requests tothe third size class, and so on. For each size class, 
the allocator creates a lot of memory that is never reclaimed because the allocator 
doesn't coalesce, and because the application never requests blocks from that size 
class again. 
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i Interaction and 
ommunication 
ween Programs 


6. this point in our study of computer systems, we have assumed that pro- 
grams run in isolation, with minimal input and output. However, in the real 
‘world, application programs use services provided by the operating system 

oo communicate with I/O devices and with other programs. 
is “This | part'of the book will give you an understanding of the basic I/O services 
fried by Unix operating systems and how to use these services to build appli- 
“Cations such as Web clients and servers that communicate with each other over 
("the Anternet, “you will learn techniques for writing concurrent programs, such as 
| "Web servers that'can Service multiple clients at the same time. Writing concurrent 
* application programs can also allow them to execute faster on modern multi-core 
processors. When you finish this part, you will be well on your way to becoming a 
f power programmer with a mature understanding of computer systems and their 

t c on your programs 
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nput/output (I/O) is the process of copying data between main memory and ex- 

ternal devices such as disk drives, terminals, and networks. An input operation 
copies data from an I/O device to main memory, and an output operation copies 
data from memory to a device. 

All language run-time systems provide higher-level facilities for performing 
I/O. For example, ANSI C provides the standard I/O library, with functions such as 
printf and scanf that perform buffered I/O. The C++ language provides similar 
functionality with its overloaded << (“put to") and >? (“get from”) operators. On 
Linux systems, these higher-level I/O functions are implemented using system- 
level Unix I/O functions provided by the kernel. Most of the time, the higher-level 
I/O functions work quite well and there is no need to use Unix I/O directly. So 
why bother learning about Unix UO? ! 


* Understanding Unix VO will help you understand other systems concepts. I/O is 
integral to the operation of a system, and because of this, we often encounter 
circular dependencies between I/O and other systems ideas. For example, 
I/O plays a key role in process creation and execution. Conversely, process 
creation plays a key role in how files are shared by different processes. Thus, 
to really understand I/O, you need to understand processes, and vice versa. 
We have already touched on aspects of I/O in our discussions of the memory 
hierarchy, linking and loading, processes, and virtual memory. Now that you 
have a better understanding of these ideas, we can close the circle and delve 
into I/O in more defail. 


Sometimes you have no choice but to use Unix I/O. There are some important 
cases where using higher-level I/O functions is either impossible or inappro- 
priate. For example, the standard I/O library provides no way to access file 
metadata such as file size or file creation time: Fürther, there are problems 
with the standard Y/O library that make it risky to use for network program- 
ming. 


This chapter introduces you to the general concepts of Unix I/O and standard 
I/O and shows you how to use them reliably from your C programs. Besides serving 
as a general introduction, this chapter lays 2 firm foundation for our subsequent 
study of network programming and concurrency. 


10.1 Unix!/O 
A Linux file is a sequence of m bytes: 
Bo, By, ..., By... Bg 


All I/O devices, such as networks, disks, and terminals, are modeled as files, and 
all input and output is performed by reading and writing the appropriate files. This 
elegant mapping of devices to files allows the Linux kernel to export a simple, low- 
level application interface, known as Unix I/O, that enables all input and output 
to be performed in a uniform and consistent way: 
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Opening files. An application announces its intention to access an I/O device 
by asking the kernel to open the corresponding file. The kernel returris 
a small nonnegative integer, called a descriptor, that identifies the file 
in all subsequent operations on the file. The kernel keeps track of all 
information about the open file. The application only keeps track of the 
descriptor. 

Each process created by a Linux shell begins life with three open files: 
standard input (descriptor 0), standard output (descriptor 1), and standard 
error (descriptor 2). The header file <unistd.h> defines constants STDIN_ 
FILENO, STDOUT_FILENO, and STDERR_FILENO, which can be used instead 
of the explicit descriptor values, 


Changing the current file position. The kernel maintains a file position k, initially 
0, for each open file. The file position is a byte offset from the beginning 
of a file. An application can set the current file position k explicitly by 
performing a seek operation. i m 


Reading and writing files. A read operation copies!n- 0 bytes from a file to 
memory, starting at the current file position k and then incrementing k 
by n. Given a file with a size of m, bytes, performing a read operation 
when k > m triggers a condition known as end-of file. (EOF), which can 
be detected by the application. There is no explicit “EOF character” at 
the end of.a file. 

Similarly, a write operation copies n > O:bytes from memory to a file, 
starting. at the current file position k and then updating k. 


Closing files Whén an applicition has finished accessing a file, it informs the 
kernel by asking it to close the file. The kernel responds by freéing 
the data structures it created when the file was opened and restoring the 
descriptor to a pool of available desétiptors. When a process terminates 
for any řeason, the ketnel closes:all open files and frees their memory 
resources, : 


10.2 Files 


Each Linux file has a type that indicates its role in the system: 


* A regular file contains arbitrary data. Application programs often distinguish 
between text files, which are regular files that contain only ASCII or Unicode 
characters, and binary files, which are everything else. To the kernel there is 
no difference between text and binary files, 

A Linux text file consists of a sequence of text lines, where each line is a 
sequence of characters terminated by a newline character (‘\n’). The newline 
character is the same as the ASCII line feed character (LF) and has a numeric 
value of Ox0a, 


* A directory is a file consisting'óf an array of links, where each link. maps a 
filename to a file, which may be another directory. Each directory contains at 
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E a (Bag v Aw a O a E S ma a wET wat 
, Aside ‘End ‘of line (EOL) indicators” 1s a * poat { s AY ego 


One of the clumsy, aspects of working ‘with téxt file¥ is that différeht systems use différent charácters to ş 

; mark the end of a line. Linux and'Mac OS X«isé "vh" (Oka) whichis the ASCII line feed (LEycharacier., 4 

I However, MS Windows and Intethet protocolssüch ds HTTP tse tlie sequence £N" (0xd.0xà), which } 

į is the ASCII carriage- teturů (CR) charactet'fóllowed by d line feed (LF)."If yoy treat a filé foo -txt™ 

in Windows and theh views in à Linux text editor, youll 'see-an annóying ^M at-the end.of each line, 

_ which is how Linux'tools diplay the CH charactér. You:cái Femóve-thése" Gif iabted" CR'Etidrácters- 

from foo txt in placeby suiting jefolloWing oiii ve^ eg nT E kt m 
* ^s vast wa al 4 


: : BR sl eod 4 
Pi E a SA I » n ^ Mia fea, qu "md wih OF oe de E 

: linux? perl -pi u$ s/\r\n/\n/g" foo.txt # "m Tm i 
t $ " 5 3 os “ 5 : P 

Tett PERMET ks am po PEU o SC tasto ue E ipn ts 


Ay 


ir 


least two entriés: . (dot) isa link to the directory itself, and .. (dot-dot) is 
a link to the parent directory in the directory hierarchy (seé below). You can 
create a directory with the mkdir command, view its contents with 1s, and 
delete it with rmdir. 


e A socket is a file that is used to communicate with another process across a 
> network (Section 11.4). 


Other file types include named pipes, symbolic links, and character and block 
devices, which are beyond our scope. 

The Linux kernel organizes all files in a single directory hierarchy‘anchored 
by the.zoot directory named / (slash). Each file in the system is a direct or indirect 
descendant of,the root directory. Figure 10.1 shows a portion of the directory 
hierarchy on our Linux.system, ss, 

As part of its context, each,progess has a current working directory that 
identifies its current location in the directory hierarchy, You can change the shell’s 
current working directory with the cd command. 














/ 
bin/ dev/ ` etc/ home/ usr/ 
bash ttyl group passwd droh/ bryant/ include/ bin/ 
hello.c stdio.h  sys/ vim 
i 
unistd.h 


Figure 10.1 Portion of the Linux directory hierarchy. A trailing slash denotes a 
directory. t 
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Locations in the directory hierarchy are specified by pathnames. A pathname 
is a string consisting of an optional slash followed by a sequence of filenames 
separated by slashes. Pathnames have two forms: m 

dw 
* An absolute pathname starts with a slash and denotes a path fronr the root 
node. For example, in Figure 10.1, the absolute pathname for hello.c is 


/home/droh/hello c. : 

* A relative pathname'starts with a filename atid denotes a path from the cufrent 
working directory.'For:exarhple, in"Figure 10.1, if /tíóne/droh is the curtent 
working directory, then the relative, pathname for. gello.cis ./hello.c.On 
the other hand, if /home/bryant is the current working directory, then the 
relative pathname is . . /home/droh/hello.c. , 


10.3 Opening and Closing Files 


A process opens an existing file or creates a.new file by calling the open function. 


#include <sys/types.h> 
#include «sys/stat.h». 


Finclude ,Cfentl .h> ; 
a " 1; 


int open(char *filename, int: flags’, mode:t mode); 
à Returns: new file descriptor if OK, —1 on error 
The open function converts a filename to a file descriptor and returns the de- 
scriptor number. The descriptor returned is always the smallest descriptor that is 
not currently open in the process. The flags argument indicates how the ‘process 
intends to access the file: y 
O_RDONLY. Reading only 
O_WRONLY. Writing only 


O_RDWR. Redding and ‘writing 


For example, here is how to open an existing file for reading: 


fd = Open("foo.txt", O RDONLY, 0); 
i f , TTE i 


The flags argument can also be ored with one or moré bit masks that provide 
additional instructions for'writing: 


"o£ 


O CREAT. If the file doesn't exist, then create a truncated (empty) version 
of it. 


i 
O TRUNC. If the file already exists, then truncate it. 


O_APPEND. Before each write operation, set the file positión to the endof 
the file. 
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Mask Description 


S. IRUSR User (owner) can read this file 
S_IWUSR User (owner) can write this file 
S_IXUSR User (owner) can execute this file 


S IRGRP Members of the owner's group can read this file 
S IWGRP Members of the owner's group can write this file 
S. IXGRP Members of the owner's group can execute this file 


S IROTH  'Others (anyone) can read this file 
S IWOTH Others (anyone) can write this file 
S IXOTH Others (anyone) can execute this file 


Figure 10.2 Access permission bits. Defined in sys/stat..h. 


For éxample, here'is how you might open an éxisting file with the intent of 
appending some data: 


5 4 


fd = Open("foo.txt", O_WRONLY|O_APPEND, 0); 


The mode argument specifies the access permission bits of new files. The 
symbolic names for these bits are shown in Figure 10.2. 

As part.of its context, each process: has a umask that is set by calling the 
umask function. When a process creates a-new file by calling the open function 
with some mode argument, then the access permission bits of the file are set to 
mode & -umask. For example, syppose we are given the following default values 
for mode and umask: 

va rent M + FH 
#define DEF MODE S_IRUSRIS_IWUSRIS_IRGRP|S_IWGRP|S_IROTH{ s_IWOTH 
#define DEF_UMASK S_IWGRP|S_IWOTH ; Q6 


Then the following code fragment creates a new file in which the owner of the file 
has read and write permissions, and all other users have read permissiong:, , 


umask (DEF, UMASK) ; z ; , p " 
fd = Open("foo.txt ", OQ CREAT|O TRUNC |O_WRONLY, DEF_MODE) ; 


Finally, a process closes an open file by calling the close function. 
* "D { 


#include <unistd.h> 


4 A 
int close(int fd); 





Returns: 0 if OK, —1 on error 


Closing a descriptor that is already closéd is an error. 





Section 10.4 Reading and Writing Files 


IPractice-P 
What is the o 


#include "csapp.h" 


1 

2 

3 int main() 

^ t1 

5 int fdi, fd2; 
6 

7 

8 

9 


fdi = Open("foo.txt", OLRDONLY, 0); 
Close(fdi); 
fd2 = Open("baz.txt", O RDONLY, 0); 
10 printf ("fd2 = %d\n", fd2); 
11 exit(0); 
12 } 


ŘS 


10.4 Reading and Writing Files 


Applications perform input and output by calling the read and write functions, 
respectively. 


#include <unistd.h> 


ssize_t read(int fd, void *buf, size t n); 
' "Retürns: number of bytes read if OK, 0 on EOF, —1 on error 


ssize!t write(int? fd, const void *buf , Size t n); i 
ia MER j Returns: number of bytes written if OK, —1 on error 


The read function Copies at most bytes frótn the'türrent file position of descriptor 
fd to rifemory locatiori buf. A return'value óf —1 indicates aif error, and a return 
value of (0 indicates EOF. Otherwise, the'returh-vahie indicafes the number of 
bytes that were.actually transferred. N r 

The write function copies,at most n bytes from memory location buf to the 
current file position of descriptor fd. Figure 10.3 shows a program that uses read 
and write calls to capy the standard input to the standard output, 1 byfe at a time. 

Applications can ‘explicitly modify the current, file position by calling the 
lseek function, which is beyond our scope. 

In some situations,'read and write transfer'fewer bytes than the application 
requests. Such short counts do not indicate an error. They occur for a number of 
reasons: 


895 
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code/io/cpstdin.c 
1 finclude "csapp.h" 
2 
3 int main(void) 
4 1 
5 char Cc; 
6 
7 while(Read(STDIN_FILENO, &c, 1) != 0) 
8 Write(STDOUT_FILENO, &c, 1); 
9 exit (0); 
10 +} 


t , 


LL —————— —- codefiofcpstdin.c 


Figure 10.3 Using read and write to copy standard input to standard output 1 byte 
at a time. 


Encountering EOF on reads. Suppose that we are ready to read from a file that 
contains only 20 more bytes from the current file position and that we are 
reading the file in 50-byte chunks. Then the.next read will return a short 
count of 20, and the read after that will signal EOF by returning a short 
count of O. 


Reading text lines from a.terminal. Jf the open file is associated with a terminal 
(i.e., a keyboard and display), then,cach read function will transfer one 
text line at a time, returping a short count equal to the size of the text line. 


Reading and writing nétwork sockets. If the open file corresponds to a network 
d, socket: (Section 11:4), then internal buffering-constraints and long net- 
» work-delays can cause read and write to return short counts. Short counts 

t &can also occur whén-you call read and write on a Linux pipe, an inter- 


''process'communication mechahism that is beyond our scope. 
[ 


^ In practice, you will never encounter short counts-when you read. fror disk 
files except on EOF, and you will never encounter short counts when you write 
to disk files. However, if you want to build robust (reliable) network applications 
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such as Web servers, then you'must deal.with short counts by repeatedly calling 
read«and write until all requested bytes have beenitransferred;! f 
s j , "n Ht 


10.5 Robust Reading and Writing with the Rio Package 


In this section, we will develop an I/O package, called the Rio (Robust I/O) 
package, that handles these short counts for you automatically. The Rio package 
provides, ç convenient, robust, and efficient JO: ‘in ‘applications such as network 
programs that are subject to short counts. Rio provides two different kinds of 


functions: : 


Unbuffered input and output functions. These functioris transfer data:directly 
"tBetwéen mémory and a file, with no application-level buffering! They are 
i especially ušeful for rèading atid iere binary data to and fom be works 


„Buffered input t functions. These, functioris Allow you to efficiently read text ‘lines 
and binary datatrom : a file whose, contents are cached i in an application- 
level buffer, similar to the one provided for standar YQ’ functions guch as 
printf. Unlike the buffered T/O routines presentec in i [110], the buffered 
Rio input functions are thread-safe (Section 12.7.1) and carr be inter- 
leaved arbitrarily on the same descriptor. For example, you can read some 
text lines from a descriptor, then some binary data, and then some more 
text lines. 


We are presenting the Rio routines for two reasons. First, we will be using 
them in the network applicátións we develop in the next two chapters. Second, by 
studying the code for these routines, you will gain a deeper understanding of Unix 
I/O in general. 


10.5.1 Rio Unbuffered Input and Output Functions 

qx 
Applications can transfer data djrectly between, memory and a file by calling the 
rio _readn and rjj. writen functions. 





Y 





D 1 c ny 








#include "csapp.h" 
Mr 
ssize_t rio readn(int fd, void *usrbuf, size t n); 
Size t rio writen(int fd, void *usrbuf, size t n); 
,Returns: number of bytes transferred if OK, 0 on EOF (rio. readn only), ~1 on error 





TE 1 T 
The rio. readn function tráhsfers up to n bytes fromthe- current file position 
of déscriptor fdito memory lócation usrbuf. Similarly, the rio writen function 
transférs n bytes froni location usrbuf to descriptor fd. The rio, readn function 
can only return a short count if it encounters EOF The rio, writen function:hever 
returns ‘a short count. Calls to.rio 'readn and rio, writen can be interleaved 
arbitrarily:on the same descriptor. 


, 
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Figure 10.&shows the code for rio_readn and rio, writen. Notice that each 
function manually restarts the read or.write function if it is interrupted by the 
return from an application signal handler. To be as portable as possible, we allow 
for interrupted system calls and restart them when necessary. 


10.5.2 Rio Buffered Input Functions 


Suppose we wanted to write a program that counts the number of lines in a text file. 
How might wé do this? One approach is to use the read function to transfer 1 byte 
at a time from the file to the user's memory, checking each byte foi the newline 
character. The disadvantage of this approach is that it is inefficient, requirifig a 
trap to the kernel to read each byte in the file. 

A better approach is to call a wrapper function (rio readlineb)that copies 
the text line from an internal read buffer, automatically making a read call to refill 
the buffer whenever it becomes empty. For files that contain both text lines and 
binary data (such as the HTTP 'fesponses described in Section 11.5.3), we also 
provide a buffered yersion of Fio readn, calléd tio reaünb, that transfers raw 


bytes from tlié'samé read buffer'as rio. readlineb. 
i 


#include "csapp.h" ) 


“void rio readinitb(rio t *rp, int fd); 
Returns: nothing 


Ssize t rio readlineb(rio t *rp, void *usrbuf, size t maxley) ; 
Ssize t rio readpb(rio t *rp, void *usrbuf, size t n); 
Returns: number of bytes read if OK, 0 on EOF, —1 on error 





The rio readinitb function.is called once per open descriptor. It associates the 
descriptor fd with a read buffer of type rio. t at address rp. 

The rio, readlineb function-féads the next text line from file rp (including 
the terminating newline character), copies'it to ‘memory location usrbuf, and 
terminates the.text line with the NULL (zero) character. The rio_readlineb 
function reads at most maxlen-1 bytes, leaving room for the terminating NULL 
character. Text lines that exceed maxlén-1 bytes are truncated and terminated 
with a NULL character. i 

The rio_readnb fünctión reads up to n bytes from file rp to memory location 
usrbuf. Calls to rió feadlineb and rio_readnb can be interleaved arbitrarily 
on the same descriptor. However; calls to these buffered functions should not be 
interleaved with.calls to the unbuffered rio. readn function. 

You will encounter numerous examples of the Rio functions in the remainder 
of this text: Figure 10.5 shows how to use the Rro functions to copy a text file from 
standard input to standard output, one line. at a time. 

Figure 10.6 shows-the formatrof a read buffer, along with the code for the 
rio, readinitb function that initializes it. The rio, readinitb function sets up 
an empty read buffer and associates an open file descriptor with that buffer. 























5 
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[D 
code/src/csapp.c H 
1 ssize_t rio_readn(int fd, void *usrbuf, size_t n) | i 
2 1 a 
3 size_t nleft = n; i E. 
4 ssize_t nread; , | i 
5 char *bufp = usrbuf; D 
6 3 
7 while (nleft > 0) { 1 
8 if ((nread = read(fd, bufp, nleft)) « O) { t 1 
9 if (errno == EINTR) /* Interrupted by sig handler return */ T 
10 nread - 0; /* arid,¢all read() again */ 4 
11 else “ E 
12 return -1; /* errno set by read() */ EE 
13 y hr 
14 else if (nread == 0) À : 
15 break; t /* EF ^ ^ I a 
16 nleft -- nread; 1 
17 , bufp += nread; § 18 
18 } . A 
19 return (n - nleft); /* Return >= 0 */ - i 
2 ) P | 
Š code/src/csapp.c ¥ | 
Eg 
S code/src/csapp.c i $ ! 
1  ssize t rio writen(int fd, void *usrbuf, size t n) | E | 
2 í 1 | | 
3 size_t nleft = n; ' l | 
4 x3 ssize_t nwritten; i it à 
5 char *bufp - usrbuf; “| 
6 | 1 
7 while (nleft > 0) ( i ; 
8 if ((nwritten = write(fd, bufp, nleft)) <= 0) 1 = i: 
9 if (errno == EINTR) /* Interrupted by sig handler return */ | 1 
10 nwritten = 0; /* and call write() again */ i 3 
11 else : 
12 return -1; /* errno set by write() */ | b 
3 Jy a ut , ‘Re e 5 
14 nleft -- nwritten; E 
"5, bufp += nwritten; i A 
16! ) V L 
17 return ái ‘ 4 ; 
18 } 2 ! , 1 
: LT I: ie code/src/csapp.c i q 


Figure:10.4 The rio_reddn and rio_writen functions. 
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— - code/io/cpfile.c 
1 #include "csapp.h" f n 
2 
3 int main(int argc, char **argv) 
4 t1 
5 int n; 
6 rio_t rio; 
7 char buf [MAXLINE] ; 
8 F 
9 Rio_readinitb(&rio, STDIN_FILENO) ; 
10 while((n = Rio readtineb(&rio, buf, MAXLINE)) !=%) 
11 Rio_writen(STDOUT_FILENO, buf, n); 
12 +} r 
code/io/cpfile.c 
Figure 10.5 Copying a text file from standard input to standard output. 
= code/include/csapp.h 
1 #define RIO BUFSIZE 8192 
2 typedef struct ( F 
3 int rio_fd; /* Descriptor for this internal buf */ 
4 int rio_cnt; /* Unread bytes in internal buf */ 
5 char *rio. bufptr; /* Next unread byte in internal buf */ 
6 char rio. buf[RIO BUFSIZE]; /* Internal buffer */ 
7 ) rio.t; EUR 
code/include/csapp.h 
Code/src/csapp.c 
1 void rio .readinitb(rio t *rp, int fd) 
2 1 i 
3 rp->rio_fd = fd; i E ; 
4 rp->rio_cnt = 0; PNE M Sut K 
5 rp->rio_bufptr = rpz?rio.buf;, : 
6 } A ! 
r~ 7 code/src/csapp.c 


Figure 10.6 A read buffer of type rio t and the rio readinitb function that initializes it. 


The heart of the Rio read routines is the rio, read function shown in Fig- 
ure 10.7. The rio, read function is a buffered version of the Linux read function. 
When rio, read is called with a request to read n bytes, there are rp-»rio cnt 
unread bytes in the read buffer. If the buffer is empty, then it is replenished with 
a call to read. Receiving a short count from this invocation of read is not an er- 
ror; it simply has the éffect of partially filling the read.buffer. Once the buffer is 








> a 
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a code/src/csapp.c 


Static ssize t rio read(rio t *rp, char *usrbuf, size t n) 


1 
2 t 

3 int cnt; 

4 H 

5 while (rp-»rio cnt <= 0) ( /* Refill if buf is empty */ 

6 rp->rio_cnt = read(rp-»rio fd, rp-?rio buf, 

7 sizeof(rp-»rio, buf)); 

8 if (rp-»rio cnt < 0) ( 

9 if (errno != EINTR) /* Interrupted by sig handler return */ 
10 return -1; le 

11 } 

12 else if (rp->rio_cnt == 0) /* EOF */ 

13 return 0; 

14 else x 

15 Mrp-^rio bufptn = rp-?rio buf; /* Reset buffer ptr */ 

16 } 

17 

18 /* Copy min(n, rp-»rio. cnt) bytes from internal buf to user buf */ 
19 cot = n; 

20 if (rp->rio_cnt < n) 

21 cnt = rp~>rio_cnt; 

22 memcpy (usrbuf , rp->rio_bufptr, cnt); 

23 rp-»rio bufptr += cnt; 

24 rp->rio_cnt -= cnt; 

25 return cnt; l 

26. =} 


c code/src/csapp.c 


Figure 10.7. The internal rio, read function. 


nonempty, rio. read copies the minimum of n and rp-»rio cnt bytes from the 
read buffer to the user buffer and returns the number of bytes copied. 

To an application program, the rio read function' has the same semantics as 
the Linux read function. On error, it returns —1andset$ ertno appropriately. On 
EOF it returns 0. It returns a short count if the number of requested bytes exceeds 
the number of unread bytes in the read buffer. The similarity of the two functions 
makes it easy to build different kinds of buffered read functions by substituting 
rio_read for read. For example, the rio_readnb function in Figure 10.8 has the 
same structure as rio readn, with rio. read substituted for read. Similarly, the 
rio-readlineb routine in Figure 10.8 calls rio read at móst maxlen-1 times. 
Each call returns 1 byte from the read buffer, which is then checked for being the 
tefihihating newline. 
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7 code/src/csapp.c 


1 ssize_t rio readlineb(rio t *rp, void *usrbuf, size_t maxlen) 
2 i 

3 int n, rc; 

char c, *bufp - usrbuf; 


4 
5 
6 for (n = 1; n < maxlen; n++) { ni 
7 if ((rc:» rio.read(?p, &c, 1)) == 1) { 
8 *bufptt = c; 

9 if (c2 'An' 0 { "7 








10 ntt; 
11 break; 
12 } 
13 ) else if (re == 0) ( 
14 if (n == 1) 
15 réturn 0; /* EOF, no data read */ 
16 else 
17 break; /* EOF, some data was read */ 
18 } else ` < 
19 return -1; /* Error */ 
20 } ^ 
21 *bufp - 0; * 
22 return n-1; ^ 
23 } w 
—— code/src/csapp.c 
code/src/csapp.c 
1 ssize t rio readnb(rio.t *rp, void *usrbuf, size_t n) 
2 i 
3 size t nleft = n; F e 
4 ssize_t nread; 
5 char *bufp = usrbuf; 
6 
7 ' while (nleft > 0) ¢ D Nu toe. Uno s 
8 if ((nread - rio read(rp, bufp, nleft)) < 0) 
9 return -1; /* errno set by read() */ 
10 else if (nfead =="0), as i 
n. "break; is EOF */ ` Sa g 
12 nleft -= nread; E 
13 bufp += nread; : i ] : 
14, ; 
15 return (n - nleft); n /* Return >= 0 i^ 
16 } 


Y aj t t t 
—— code/src/csapp.c 


Figure 10.8 The rio. readlineb and rio. readnb functions. 
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ee Saa FR RII eS NODE ag D yi RO 09 wae AN uan aS * won emi » o» 


i Aside "Origins of the Rio package 


md DU Yes 

The Rio functions are taped br the resdtine: readn, ‘and writen Bikon described by W. Richard 
Stevens in his classic network programming'text [110]. Therio_readn and rio_writen functions are j 
identical to the Stevens readn and writen furictions. However; the Stevens read] ine function has some 
limitations that áré-correctéd' in Rito. First, because readline is buffered arid readn is not, these two io 

functions cannot be used together « on the same descriptor. ‘Second, becausé it uses a static buffer, the 5 
Stevens readline function is'not thread- safë, ‘which required’ Stevens to introduce a different thread- 
‘safe version'called readline_ r. "We havé corrected both of these flaws with the rio_readlineb and 
rio_readnb functions, which : are mütually compatible and. thread-safe: 


T ^ i 
dem reno sap Sk cm ae ww oa A ai Wo Woo eae S y SEE ea 5 wx md ws fne imc hit ~ 


NY e 


ET 


x 


|. 10.6 Reading File Metadata. 


An application can retrieve information about a file (sometimes called the file's 
metadata) by calling the stat and fstat functions. 


#include <unistd.h> 
#include’<sys/stat .h> 


int stat(const char *filename, struct stat *buf); 
int fstat(int fd, struct stat *buf); 
Returns: 0 if OK, —1 on error 





The stat function takes as input a filename and fills in the members of a stat 
structure shown in Figure 10.9. The fstat function is similar, but it takes a file 
descriptor instead of a filename. We will need the st. mode and st. size members 





of the stat structure when we discuss Web servers in Section 11.5. The other 
members are beyond our scope. Ct 

The st. size member contains the file size in. bytes. The st. mode member t 1 | 
encodes both the file permission bits (Figure 10.2) and the file type (Section 10.2). i 1 | 
Linux defines macro predicates in sys/stat.h for determining the file type from | = | 
the st_mode member: 1 


S_ISREG(m). Is this a regular file? ' 
S_ISDIR(m). Is this a directory file? 
S_ISSOCK(m). Is this a network socket? 


Figure 10.10 shows how we might use these macros and the stat function to read 
and interpret a file's st. mode bits. i 
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statbuf.h. f Y vedi by sys/staf. h) 


/* Metadata returned by the stat and fstat functions */ 
* struct stat d + 8 : 


dev t st dev; /F.Device */ G 

ino_t st_ino; /* inode */ E 

mode_t st mode; /* Protectioniand file type */ , 
,nlink t st nlink; /* Number of hard links */ 

vid, t ae st_uid; /* User ID of owner */ 

gid_t st_gid; /* Group ID of owner */ 

dev_t st_rdev; /* Device type (if inode device) */ 

off_t st_size; /* Total size, in bytes */ 


unsigned long st_blksize; /* Block size for filesystem I/0 */ 
unsigned long st blocks; /* Number of blocks allocated */ 


time_t st_atime; /* Time of last access */ 
time_t st_mtime; /* Time of last modification */ 
time_t st_ctime; /* Time of ‘last.change */ iD 


uH 





statbuf.h (i included by Sys/stat. ‘h) 


i tM Y 


Figure 10.9 The stat structure. 


In 4 





7 code/io/statcheck.c, 

1 #include "csapp.h" ï 

2 

3 int main (int argc, char **argv) 

4 t 

5 struct stat stat; 

6 char *type, *readok; 

7 ; Me 

8 Stat(argv[1], &stat); “I 4 

9 if: (S ISREG(stàát.st  mode)) /* Detefhié file type */ !  * 

10 type-= "regular"; bo n 2b 

11 else if (S_ISDIR(stat.st_mode)) ‘ 1 f 
12 type = "directory"; a ys if 

13 else, , 1 7 i 
14 type = "other"; ab 

15 if ((stat.st mode & S IRUSR)) /* Check read access, */, 

16 readok - "yes"; 

17 else 

18 readok - "no"; se I* 

19 

20 printf("type: 4s, read: %s\n", type, readok); E | 
21 exit(0); " i ( 

22 + 
———— HM ——— BÓ Ww codgfo/statcherk.c 


Figure 10.10 Querying and manipulating a file's st mode bits. ' f 
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10.7 Reading Directory Contents 


Applications can read the contents of a directory with the readdir family of 


functions. 
1t 


#include <sys/types.h> 
#include <dirent.h> 


DIR *opendir(const char *name); f 
Returns: pointer to handle if OK, NULL on error 





The opendir function takes a pathináme and returns a pointer toa directory stream. 
A stream is an abstraction for an orderéd lit of items, inthis casé a list of directory 
entries, 


i, 


#include <dirent.h> 


stryct dirent *readdir(DIR *dirp); 
Returns: pointer to next directory entry if OK, NULL if no more entries or error 





n - - 


Each call to readdir returns a pointer to the next directory entry in the stream 
dirp, or NULL if there are no more entries. Each directory entry is a structure of 
the form 


struct dirent ( “Pjs 
ino_t d_ino; /* inode number */ 
char 'd name[256]; ⁄*' Filename */ 

}; n i 


Le à 


Although some versions of Linux include other structure members, these 
are the only two that are standard across all systems. The,d name member is the. 
filename, and d. ino is the file location. h 

On error, readdir returns NULL and sets eryno. Upfortunately.the'only way 
to distinguish an error from the end-of-stream condition is to check if errno has 
Been modified since the call to readdir. 


i A n Qe 


k Winclugé «dirént.h» 


b 


int closedir(DIR *dirp); 
Returns: 0 on success,*+1 on error 





oe H 


Thelbsedir function closes the stream and frees up.any of its resources. Fig- 
ure 10.11 shows how we might use readdir to read the contents of.a.dire¢tory. 
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ga te te tefosvendtiin.c 


1 #include "csapp.h? , 
2 
3 int main(int argc, char **argv) 
4 t 
5 DIR *streamp; 
6 struct dirent *dep; 
7 
8 streamp = Opendir(argv[1]); 
9 
10 errno = 0; 
11 while ((dep, 7 readdir (streamp}) Iz NULL) 1 : 
12 ,  mrintf("Found file: %s\n", dep-»d, name) ; 
13 F 
14 if (errno != 0) 
15 unix_error("readdir error"); 
16 
17 Closedir(streamp) ; 
18 exit(0); 
19 } LE ~ 1 
——- code/iofreaddir.c 
Figure 10.41 Reading the contents of'a directory. : 
10.8 Sharing Files 
1 t 


Linux files can be shared in a number of different, ways, Unless you ‘have-a-clear 
picture of how the kernel represents open files, the idea of file sharing can be quite 
confusing. The kernel represents open files using three related data structures: 
^ An 1 Ye 
Desériptor table. Each process has its own separate descriptor table whóse én- 
tries are indexed by the process's open file descriptors. Each open déscrip- 
“toperttry poitits to art entry inthé file table: b 


File table. The set of open files is represented by a file table that is shared by all 
processes. Each file table entry consists of (for our purposes) the current 
file position, a reference count of the number of descriptor entries that 
currently point to it, and a pointer to an entry in the v-node table. Closing 
a descriptor decrements the reference count in the associated file table’ 
entry. The kernel will not delete the file table entr ‘until its reference, 
count is zero. 


v-node iable. Like the file table, the v-node table is shared by all processes. Each 
entry contains most of the information in the stat structure, including thé 
st, mode and'st. size members." at n J 





Figure 10.12 we 
Typical kernel data’ 
Structures for open 
files. In this example, 


"Descriptor table 
(one table 
per process) 
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Open file table v-node table 
(shared by (shared by, z + 
all processes) i all processes) " 
File A : 





stdin fdO[ ^ | 
stdout fdjid| | 
stderr fd2| | 


[es 










two descriptors reference 
distinct files. There’ is no 






















sharing. fd3| — | 
fd4L — [c $6 
o 
9^ = File type 

refcnt=1 
$ ieee AH E 
Figure 10.13 Descriptor table Open file tahle v-node table 
File sharing. This example (one table (shared by (shared by 
shows two descriptors * ' | per process) all processes) all processes) 

File A "a 


sharing the same disk file 


through two open file table 
‘entries. 
File type 





Figure 10.12 shows an example where descriptors 1 and 4 reference two 
different files through distinct open file table entries. This is the typical situation, 
where files are not shared and where each descriptor corresponds to a distinct file. 

Multiple descriptors can also reference the same file through different file 
table entries, as shown in Figure 10.13. This might happen, for example, if you 
were to call the open function twice with the same filename. The key idea is that 
each descriptor has its own distinct file position,.so different reads on different 
descriptors can fetch data from different locations in the file, .. 1 

We can also understand how parent and child processes share files. Suppose 
that before a call to fork, the parent process has the open files shown in Fig- 
ure 10.12. Then Figure 10.14 shows the situation after the call to fork. 

The child gets its own duplicate copy of the parent's descriptor table. Parent 
and child share the same set of open file tables and thus share the same file pos- 
ition. An important consequence is that the parent and child must both close their 
descriptors before the kernel will delete the corresponding file table entry. 
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Figure 10.14 


How" child process à 
inherits the parent's open 
files. The initial situation is 


in Figure 10712. 


“ Suppose the disk file £oobar.txt consists of the six ASCII characters foobar. 


s before, suppose the disk file foobar . txt consists of the six ASCII characters 


Descriptor tables Open file table v-node table 
(shared by (shargd by 
rall processes) all processes) 

















Parent's table File A 
iol ee 
ja 
as | 
t4 MEME 


File B 


Child's table mM 
íd2[ | 
id3| — | ae 










ng: a 





Then what is the output of the following program? 


~ 


1 #include "csapp.h" 

2 

3 int main() 

4 í 

5 int fdi1, fd2; 

6 char C; 

7 

8 fdi = Open("foobar.txt", O RDONLY, 0); 
9 fd2 = Open("foobar.txt", 0 RDONLY, 0); 
10 Read(fdi, &c, 1); 

11 Read(fd2, &c, 1); 

42 printf("c = %c\n", c); 

13 exit (0); 3 
14 J 








T. pr NT 


i 4 


1 ‘#include "csapp.h" 

2 

3 int main() vel 1 
4 { "oa n » 

5 int fd; i d 

6 char C; n 1 
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7 

8 fd = Open("foobar.txt", O RDONLY, 0); 

9 if (Fork(} == 0) { 

10 Read(fd, &c, 1); ye IU 
11 exit(0); 

12 } 

13 Wait (NULL) ; 

14 Read(fd, &c, 1); 

15 printf("c = %c\n", c); " 
16 exit(0); l 5 
17 } x wo z 


10.9 1/O Redirection 


Linux shells provide VO redirection opérators that allow users to associate stan- 
dard input and output with disk files. For example, typing 

E 
linux» Is > foo.txt ü 


causes the shell to load and execute the 1s program, with standard output redi- 
rected to disk file £oo. txt. As we will see in Section 11.5, a Web server performs 
a similar kind of redirection when it runs a CGI program on behalf of the client. 
So how does I/O redirection work? One way is to use the dup2 function. } 


#include <unistd.h> 
* 4 


int dup2(int oldfd, int newfd); 
Returns: nonnegative descriptor if OK, —1 on error 





rs 


The dup2 function copies descriptor table entry oldfd to descriptor table entry 

newfd, overwriting the previous contents of descriptor table entry newfd. Ifnewfd 

was already open, then dup2 closes newfd before it copies oldfd. | 
Suppose-that before calling dup2(4,1), we have the situation in Figure 10.12, | 

where descriptor 1 (standard output) corresponds to file A (say, a terminal) 

and descriptor 4 corresponds to file B (say, a disk file). The reference counts 

for A-and B are both equal to T. Figure 10.15 shows the situation after calling 

dup2(4, 1). Both descriptors now point to file B;:file A has been closed and its 

file table and v-node table entries deleted; and the refererice count for file B has 

been incremented. From this point on, any data written to standard' output are | 

redirected to file B. | 





ra tice t rok A I Ae 


How would: you use dup? t to o tediredt standard input to desapilar 5? 
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eee a ee NR aon ies a Htc NAPO a i aE deci Se ute Nit 9m X Me 
ipAsidé, Right and léft.hoinkies V" xe oe Qu. de i 
" To,avóid confusion with other bracket-type.operators such as ‘]’ and ‘[’, We lave always teferred to*: 
oae" bo Ke ype-op 2 and | way 
‘the shell’s ‘>’ ópérator as a “right hoiñky” and the ‘< opefator as'a “left hoinky? °. , i 
a amit uoa POS com ih TERR, RTE Ri ONE IT TOT ey ' qua OR pho KONSE ME Ct ebd: uS eri PD a ne —: Arie SS: oct a TEC 
Figure 10.15 Descriptor table Open file table v-node table 
Kernel data structures (one table kun by ee by 
after redirecting standard per process) a m all processes) 
output by calling S aen "-—-— RT 
dup2(4, 1). The initial e end p^ iFile accessi 
situation is shown in fd2 Ld pos _i i Eie size. 
Figure 10.12. fd 3 irefcnt-0i | Filetype ;7 
fd 4 NCC. PONE. 
* [m pre-e- mn———— 3 i 
File B 1 












p. oue 
p 















Assuming that the disk file foobar.txt consists of the six ASCII characters 
foobar, what is the output of the following program? 5 


AN n z} 

1 #include "csapp.h" 
2 )o7 
3 int main() i i i T 
4 1 We 
5 int fdi, fd2; F 4A. 
(6 v, char c; m * X 
7 
8 fdi =;Open("foobar.txt", D RDONLY, 0); t 

fd2 z.0pen('foobar.txt", O_RDONLY, 0); yo x 
10 Read(fd2, &c, 1); i L 
11, Dup2(fd2, fdi); dat i 
12 Réad(fdi, &c, 1); F 
13 printf ("c = %c\n", c); 
14 exit(0); 
15 } 


x 
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10.10: Standard 1/O j j 


TheC language defines a set,of, higher-level input and output functions, called the 
standard I/O library, that provides programmers with a higher-level alternative 
to Unix I/O. The library (1ibc) provides functions for opening and closing files 
(fopen and fclose), reading and writing bytes (fread and fwrite), reading and 
writing strings (fgets and fputs), and , sophisticated formatted I/O (scanf and 
printf). 

The standard I/O library models an open file as a stream. To the programmer, a 
stream is a pointer to a structure of type FILE. Every ANSI C program begins with 
three open streams, stdin, stdout, and stderr, which correspond to standard 
input, standard output, and standard error, respectively: 


#include <stdio‘ h> 

extérn FILE *stdin;! /* Standard inpüt (descriptor 0) */ 
extern FILE *stdout} /* Standard output’ (descriptor 1) */ 
éktern FILE *stderr; /* Standard error (descriptor 2) */ 


me 2» 

A stream of type FILE is an abstraction for a file descriptor and a stream 
buffer. The purpose of the stream | buffer is the same as the Rio read, ‘puffer: to 
minimize the number of expensive Linux I/O systém calls. For, example, suppose 
we have a program that makes repeated calls to the standard I/O getc function, 
where each invocation returns the next character from a file. When getc is called 
the first time, the library fills the stream buffer with a single call to the read function 
and then returns the first byte in the buffer to the application. As long as there are 
unread bytes in the buffer, subsequent calls to getc can be served directly from 
the stream buffer. 


A 
10.11 Putting It Together: Which I/O Functions Should | Use? 


Figure 10.16 summarizes the various I/O packages that we have discussed in this 
chapter. 


fdopen 
fwrite 
fprintf 
sprintf » 
fputs ||’ 
fseek 


rio_readn 
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Figure 10.16 Relationship between Unix I/O, stindard I/O, and RIO. 
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The Unix I/O model is implemented in the operating system kernel. Itis avail- 
able to applications through functions such as open, close, 1seek, read, write, 
and stat. The higher-level Rio arl standard I/Q functions are implemented “dn 
top of” (using) the Unix I/O functions. The Rio fünctions are robust wrappers for 
read and write that weré developed specifically for this textbook. They automati- 
cally deal with short couhts and přovidé an efficient buffered approach for reading 
text lihes. The standard I/O fuhctions provite à more complete buffered alterna- 
tive to the Unix I/O functions, including formatted I/O routines such as printf 
and scanf. 

So which of these functions should you use'in your programs? Here are some 
basic guidelines: ] 


G1: Use the standard I/O functions whenever possible. The standard I/O func- 
tions are the method of choice for I/O on disk and terminal devices. Most 
C programmers use standard I/O exclusively throughout their careers, 
never.bothering with the lower-level Unix I/O functions (except possibly 
stat, which has no counterpart in the standard I/O library). Whenever 
possible, we recommend that you do likewise. 


G2: Don’tuse scanf or rio_readlineb to read binary files. Functions like scanf 
and rio_réadlineb 'ate designed specifically for reading text files. A 
common error that students make is to use these functioris to read binaty 
data, causing their programs to fail in strange and unpredictable wáys. 
For example, binary files might be littered with many Oxa bytes that have 
nothing fo do with terminating text lines. 


G3: Use the Rio functions for I/O on network sockets. Unfortunately, standard 
I/O poses some nasty problems when we attempt to use it for input and 
output on networks. As we will see in Section 11.4, the Linux abstrac- 
tion for a network is a type of-file called a-socket. Like any Linux.file, 
sockets are referenced by file descriptors, known in this case as socket de- 
sériptórs. Application brócesses comnfunicate with'processes running on 
other computers by reading and writing socket descriptors. 


Standard I/O streams are full duplex in the sense that programs can perform 
input and output on the same stream. However, there are poorly documented 
restrictions on streams that interact badly with restrictions on sockets: 


Restriction 1: Input functions following output functions. An input function 
cannot follow an output function without an intervening call to fflush, 
fseek, fsetpos, or rewind. The fflusk function empties the buffer as- 
sociated with a stream. The latter three functions use the Unix I/O 1seek 
function to reset the current file position. : 


FL 


Restriction 2: Output fünctions following input functions. An output function 
cannot follow an' input function withóut an intervening ‘call to fseek, 
fsetpos, or rewind, unless the input function encounters an end-of-file. 
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These.restrictions pose a problem for network applications because it is illegal 
to,use the 1seek function om a,socKet: The first restriction àn.stream I/O can be 
worked around by adopting a discipline.of flushing:the buffer before evéry input 
operation. However, the only way to work around the second restriction is to 
open two streams on the same open socket descriptor, one for-reading, and one 
for writing: 

i Lu 

FILE *fpin, *fpout; , 

“y 
fpin = fdopen(sockfd, "r"); 
fpout = fdopep(sockfd, "w"); 


But this approach has problems as well, because it requires the application-to call 
fclose on botli streams in order to free the memory resources associated with 
each stream and avoid a memory leak: 


fclose(fpin); 
fclose(fpout); 


Each of these operations attempts to close the same underlying socket descriptor, 
so the second close operation will fail. This is not a problem for sequential 
programs, but closing an already closed descriptor in a threaded program is a 
recipe for disaster (see Section 12.7.4). 

Thus, we recommend that you not use the standard I/O functions for input 
and output on network sockets. Use the robust Rro functions instead. If you need 
formatted output, use the sprintf function to format a string in memory, and then 
send it to the socket using rio. writen..lf you' need formatted input, use rio_ 
readlineb to read an entire text line, and then use sscanf to extract different 
fields from the text line. 


10.12 Suminay 


: Linux provides a small number of system-level functions, , ‘based on the Unix I/O 
i model, that allow applications to open, close, read, and write files, to fetch file 
metadata, and to perform I/O redirection. Linux read and write operations«are 
subject to short counts that applications must anticipate-and handle correctly. 
Instead of calling the Unix I/O functions directly, applications should use the Rio 
package, which'deals with short counts automatically by repeatedly performing 

read and write operations until all of the requested data have been transferred. 
~ {The Fitiux kerriel uses three related data structures to represent open files: 
Entries in a descriptor table point to entries in the open file table, which point 
to entries in the v-node table. Each process has its own distinct descriptor table, 
while all processes share the same open file and v-node tables. Understanding the 
general organization of these structures clarifies our understanding of both file 
» sharing and I/O redirection. moto + 
The standard VO library isimplemented on top of:Unix I/O:and provides a 
powerful set of higher-level I/O routines. For most applications, standard I/O is the 
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simpler, preferred alternative.to Unix.I/O. However, because of some mutually 
incompatible restrictions on standard I/O and network files, Unix I/O, rather than 


standard I/O,should be:used formetwork applications. "^ — «v. 
t d Z 4 
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Homework Problems 2 


106 + 
What is the output of the following program? 


1 #include "csapp.h" 
2 
3 int mainO T 
4 íi ^ 
5 int fdi, fd2; i z 
6 i 
7 fài =,0pen("foo.txt", O RDONLY, 0); 
8 fd2 = Open("bar.txt", O0 RDONLY; 0); ~ 08 
9 Close(fd2); œ ge 
10 2  fd2 = Open("baz.txt'", O RDONLY, Os 
11 printf("fd2:-2 d\n", fd2)54 
12 exit(0); “a 
33 ë } 
10.7 @ T 
Modify the cpfile program in Figure 10.5 so that it uses the Rio functions to copy 
standard input to standard output, MAXBUF bytes at a'tirné: 5 : 
1 Ww ) ope 1 ^U 
10.8 99 ( pun i 


Write a version of the.statcheck-program in Figure 10.10, called fstatchecks 
that takes a descriptor number on the command dine rather thara filename: — ' 


doas D 


10.9 99 
è t, j 
Consider the, following invocation of the fstatcheck program from Problem 10.8: 
linüx» fstatcheck 3 4 foo.txt 3 gu 
diy a s D E nó j 


You might expect that this invocation of fstatcheck would fetch and display 
nletadata for file £oo.txt: However, when we run it on our system, it fails with 
a “bad file descriptor." Given this behavior, fill in the pseudocode that the-shell 
must-be executifip between the fork and exé¢ve calls: ' 

t f 1? ous CF s e 





N 


Solutions to Practice Problems 


if (Fork() == 0) { /* child */ 
/* What code is the shell executing right,here? */ 
Execve("fstatcheck", argv, envp); 


} 


10.10 99 i g ; 

Modify the cpfile program in Figure 10.5 so that it takes an optional command- 
line argument infile. If infile is given, then copy infile to standard output; 
otherwise, copy standard input to standard output as before. The twist is that your 
solution must use the original copy-Ioop (lines 9-11} for both cases. You are only 
allowed to insert code, and you are not allowed to change any of the existing code. 


Solutions to Practice Problems 


Solution to Problem 10.1 (page 895) 

Unix processes begin life with open descriptors assigned to stdin (descriptor 0), 
stdout (descriptor 1), and stderr (descriptor 2). The open function always re- 
turns the lowest unopened descriptor, so the first call to open returns descriptor 3. 
The call to the close function frees up descriptor 3. The final call to open returns 
descriptor 3, and thus the output of the program is fd2 = 3. 


Solution to Problem 10.2 (page 908) 

The descriptors fdi and fd2 each have their own open file table entry, so each 
descriptor has its own file position for foobar. txt. Thus, the read from fd2 reads 
the first byte of foobar . txt, and the output is 


c= f 
and not 
c=o 


as you might have thought initially. 


Solution to Problem 10.3 (page 908) 

Recall that the child inherits the parent’s descriptor table and that all processes 
shared the same open file table. Thus, the descriptor fd in both the parent and 
child points to the same open file table entry. When the child reads the first byte 
of the file, the file position increases by 1. Thus, the parent reads the second byte, 
and the output is 


c= 0 


Solution to Problem 10.4 (page 909) 
To redirect standard input (descriptor 0) to descriptor 5, we would call dup2(5,0), 
or equivalently, dup2(5,STDIN_FILENO). 


915 








916 


Chapter10  System-Level I/O 


Solution to.Problem 10.5 (page 910) 
At first glance, you might think the output would be 


c-f 
but because we are redirecting fd1 to £d2, the output is really 


c= 0 


t 
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Network Programming 


NES applications are everywhere. Any time you browse the Web, send an 
email message, or play an online game, you are using a network application. 
Interestingly, all network applications are based on the same basic programming 
model, have similar overall logical structures, and rely on the same programming 
interface. 

Network applications rely on many of the concepts that you have already 
learned in our study of systems. For example, processes, signals, byte ordering, 
memory mapping, and dynamic storage allocation all play important roles. There 
are new concepts to master as well. You will need to understand the basic client- 
server programming model and how to write client-server programs that use the 
services provided by the Internet. At the end, we will tie all of these ideas together 
by developing a tiny but functiondl Web server ‘that can serve both static and 
dynamic content with text and graphics to real Web browsers. 


11.1 TheClient-Server Programming Model 


Bvery network application is based on the client-server model. With this model, an 
application consists.of a server process and one or more client processes. A server 
manages some resource, and it provides some service for its clients by manipulating 
that resource. For example, a Web server manages a set of disk files that it retrieves 
and executes on behalf of clients. An FTP server manages a set of disk files that it 
stores and retrieves for clients. Similarly, an email server manages.a spool file that 
it reads and updates for clients. 

The fundamental operation in the client-server model is the transaction (Fig- 
ure 11.1). A client-server transaction consists of four steps: 


1. When a client needs service, it initiates a transaction by sending a request to 
the server. For example, when a Web browser needs a file, it sends a request 
to a Web server. : d f 

2. The server receives the request, interprets it, and manipulates its resources in 
the appropriate way. For example, when a Web server receives a request from 
a browser, it reads a disk file. 

3, The server sends a response to the client and then waits for the next request. 
For example, a Web server sends the file back to a client. 


1. Client sends request 





















Server 
process 


Client 
process 






4. Client 3. Server sends response 2. Server 
processes processes 
response request 


Figure 11.1 A client-server transaction. 
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4. The c gljent receives the resppnse and manipulates it. For example, after a Web 
browser receives 2 page from the server, it displays it on the screen. 


It is important ‘to realize that clients aid servers re processes and not ma- 
chines, or hosts‘dis they’ ‘are dfteh daffed in this context. A siriglé host ¢an run many P 
different clients anid sérvers co curently, and a client and sérver' transactión can à 
be on the same of different hdsts. The client-server rhodefi is the same, , regardless 
of the mapping of clients and servelis to hosts, 
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h 
hsil fí K 
Clients and Servers often run on separate: hosts and communicate using the hard- 4 
ware and software resources of a computer network. Networks are sophisticated 
systems, and We camonly hope to scratch the surface here. Our aim is to give you 














a-workable mental model from a programmer’sperspective, 1 
To a host, a.network is just another I/O device that serves-as a source and sink : 
for data; as shówn in.Figure 11.2. : > t3 , 
LE] 
Li i $ 
a 
Figure 11.2 i CPU'chip 
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Figure 11.3 
Ethernet segment. 
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100 Mb/s `° 100 Mb/s 


An adapter plugged into an expansion slot on the I/O bus provides the physical 
interface to the network. Data received from. the network, are copied from the 
adapter across the I/O and memory buses into memory, typically by a DMA 
transfer. Similarly, data can also be copied from memory to fhe netw rk. 

Physically, a network is a hierarchical system that is organized by, geographical 
proximity. At the lowest level is a LAN (local area network) that spans a building 
or a campus. The most popular LAN technology by far is Ethernet, which was de- 
veloped in the mid-1970s at Xerox PARC. Ethernet has proven to be remarkably 
resilient, evolving from 3 Mb/s to 10 Gb/s. 

An Ethernet segment consists of some wires (usually twisted pairs of wires) 
and a small box called a hub, as shown in Figure 11.3. Ethernet segments typically 
span small areas, such as a room or a floor in a'building. Each wire has the same 
maximum:bit bandwidth, typically 100 Mb/s or 1 Gb/s. One endis attached to 
an adapter on.a host,-and the other end is attachedsto a port on the hub. A:hub 
slavishly copies every bit that it receives on each port to every other port. Thus, 
every host sees every bit. 4 

Each Ethernet adapter has a globally unique 48-bit address that is stored in 
a nonvolatile memory on the adapter. A host can send a chunk of bits called a 
frame to any other host on the segment. Each frame includes some fixed number 
of header bits that identify the source and destination of the frame and the frame 
length, followed by a payload of data bits. Every host adapter sees thé frame, but 
only the destination host actually reads it. i 

Multiple Ethernet segments can be connected into larger LANs, called 
bridged Ethernets, using a set of wires and small boxes called bridges, as shown 
in Figure 11.4. Bridged Ethernets can span entire buildings or campuses. Ina 
bridged Ethernet, some wires connect bridges to bridges, and others connect 
bridges to hubs. The bandwidth’ of the wires can be different. In our example, 
the bridge-bridge wire has a-1 Gb/s bandwidth, while the four hub-bridge wires 
have bandwidths of 100 Mb/s. 

Bridges make better use of thé available wire bandwidth than hubs. Using a 
clever distributéd algorithm, they automatically learn over time which hosts are 
reachable from which ports and then selectively copy frames from one port to 
another only when it is necessary. For example, if host A sends a frame to host B, 
which is on the segment, tlterf bridge X will throw away the frame when it arrives 
at its input port, thus saving bandwidth on the other segments. However, if host A 
sends a frame to host C on a different segment, then bridge X will copy the frame 
only to the port connected to bridge Y, which will copy the frame only to the port 
connected to host C's segment. 
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: To simplify our,pictures of LANs, we will draw the hubs-and bridges and the i 
wires that connect them as a single, horizontal line, as shown in Figure 11.5. 
At a higher level in the hierarchy, multiple incompatible LANs can be con- I 
néctéd by specialized computers called-róutets to forfn án internet (intérconnécted 
network). Each router has an adapter (fort) foreach network that it is connected 
to. Routers can also connect high-speed point-to-point phone connectigns, which 
are examples.of networks known as WANS (wide area networks), so called be- 
cause they span larget geographical-areas than LANs. In gerieral, routers can be 
used to build internets from'arbitrary cóllections of LANs and WANs. For. ex- $ 
ample, Figure 11.6,shows ian example internet.with«a:pair of LANs anid’WANs a] 
connected by three routers. S08 ] | Tam 
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Figure 11.6 A small internet. Two LANs and two WANs are connected by three routers. 


The crucial property of an internet is that it can consist of different LANs 
and WANs with radically different and incompatible technologies. Each host is 
physically connected to every other Host, but how is it possible for some source 
host to send data bits to another destination host across all of these incompatible 
networks? 

The solution is a layer of protocol software running on each host and router 
that smoothes out the differences between the different networks. This software 
implements a protocol that governs how hosts and routers cooperate in order to 
transfer data. The protocol must provide two basic capabilities: 


Naming scheme. Different LAN technologies have different and incompatible 
ways of assigning addresses to Hosts. The internet protocol smoothes these 
differences by defining a uniform format for host addresses. Each host 
is then assigned ‘at least one of these internet addresses that uniquely 
identifies it. 


Delivery mechanism. Different networking technologies have different and 
incompatible ways of encoding bits on wires-and of packaging these bits 
into frames. The internet protocol smoothes these differences by defining 
a uniform way to bundle up data bits into discrete chunks called packets. A 
packet consists of a header, which contains the packet size and addresses 
of the source and destination hosts, and a payload, which contains data 
bits sent from the source host. 


Figure 11.7 shows an example of how hosts and routers use the internet 
protocol to transfer data across incompatible LANs. The example internet consists 
of two LANs connected by a router. A client running on host A, which is attached 
to LANI, serfds a sequence of data bytes to a server running on host B, which is 
attached to LAN2. There are eight basic steps: ^ 

i ; 


1. The client on host A invokes a system call that copies the data from the client’s 
virtual address space into a kernel buffer. 


2. The protocol software on host A creates a LANI frame by appending an 
internet header:and a*LANI1 frame Héadersto the data. The internet header 
is addressed to internet host B. The LAN1 frame header is addressed to the 
router. It then passes the frame to the adaptér. Notice that the.payload of the 
LAN1 frame is an internet packet, whose payloads the actual user data. This 
kind of encapsulation is one of the fundamental insights of internetworking. 
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Figure 11.7 How data travel from one host to another on an internet, PH: internet 
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3. The LANI adapter copies the franie to the network. 


4. When the frame reaches the router, the router's LANI adapter reads‘it from 
the wire and passes it to the‘protocol software. 


5. The router fetches the destination internet address from the internet packet 
header and uses this as an index into a routing table to determine where to 
forward the packet, which in this case is LAN2: The router then strips off the 
,9ld LAN d frame header, prepends a new LAN2 frame header addressed to 
"host B, and passes the resulting frame to the adapter. 


6. The router's LAN2 adapter copies the fine to the network. 


7. When thé frame reaches host B, its adapter reads the frame from the wire and 
^passes it to the protocol software. Pf 


8. Finally, the protocol software on host B strips off the packet Header, and frame 
header. Thé protocol software will éventually'Copy the-resulting dáta into the 
server's virtual address $pace when the server invokes a system call that reads 
the data. 


t € 
-Of course, we are glossing over many difficult issues here. What if different 
networks have different maximum frame sizes? How do routers know where to 
forward frames? How are routers informed when the network topology changes? 
What if a packet gets lost? Nonetheless, our example captures the essence of the 
internet idea, and encapsulation is the key. 
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Figure 11.8 Internet client host Internet server host 
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11.3 The Global IP Internet 


The global IP Internet is the most famous and successful implementation of an 
internet. It has existed in one form or ánother since 1969. While the internal 
architecture-of the Internet is complex and constantly changing, the organization 
of client-server applications has remained remarkably stable since the early 1980s. 
Figure 11.8 shows the basic hardware and software organization of an Internet 
client-server application. 

Each Internet host runs software that implements the TCP/IP protocol 
(Transmission Control Protocol/Internet-Protocol), which is supported.by almost 
every modern computer system. Internet clients. and servers communicate using 
a mix of sockets interface functions and Unix I/O functions. (We will describe the 
sockets interface in Section 11.4.) The sockets functions are typically implemented 
as systeni calls'that trap into the kernel and call various ketnel-mode functions in 
TCP/IP. ° i 

TCP/IP is actually a family of protocols, eách of which contributes different 
capabilities. For example, IP provides the basic nanjing scheme dnd a delivery 
mechanism that can send packets, known as datagrams, from one Internet host to 
any other host. The IP niechanism is unreliable in the sense that it makes no effort 
to recover if datagrams are lost or-duplicated in the network:-E DP (Unreliable 
Datagram Protocol) extends IP slightly, so that.datagrams can be transferred 
from process to process, rather than host to host. TCP is a complex protocol that 
builds on IP to provide reliable full duplex (bidirectional) connections between 
processes. To simplify our,discussion, we will,treat TCP/IP as a single monolithic 
protocol. We will not discuss its inner workings, and we will only discuss,some of 
the basic capabilities that TCP and IP provide to application programs. We will 
not discuss UDP. 

From a programmer’s perspective, we can think of the Internet as a worldwide 
collection of hosts with the following properties: 





* The set of hosts is mapped.to a set of 32-bit IP addresses. 
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! The original Ińternet protocol, with its 32-bit addresses, isnown as Internet Prototol Version 4 (IPv4). ; 
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* The set of IP addresses is mapped to a set of identifiers called Internet domain 
names. TUN | 


* A process on one Internet hogt can communicate with a process on any other 
Internet host over a connection. 


The following sections discuss these fundamental Internet ideas in more detail. 


11.3.1 iP Addresses 


L] 
An IP address is an unsigned 32-bit integer. Network programs store IP addresses 
in the IP address structure shown in Figure 11.9. 
Storing a scalar address in a structure is an unfortunate artifact from the early 
implementations of the sockets interface. It would make more sense to define 
a scalar type for IP addresses, but it is too late to change now because of the 
enormous installed base of applications. 
Because Internet hosts can have different host byte orders, TCP/IP defines a 
uniform network byte order (big-endian byte order) for any integer data item, such 1 
as an IP address, that is carried across the network in a packet header. Addresses in 
IP address structures are always stored in (big-endian) network byte order, even 
if the host byte order is little-endian. Unix provides the following functions for 
converting between network and host byte order. 





à 
code/netp/netpfragments.c | 
/* IP address: structure */ 1 
; struct in_addr { | 

uint32 t s addr; /* Address in network byte order (big-endian) */ 


}; 


1 


code/netp/netpfragments.c 


i Figure 11.9 P address structure. 
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#include «arpa/inet.h» 


uint32/t htonl(uint32 t hostlong) ; 
uinti6 t htons(uinti6 t hostshort) ; 
Returns: value in network byte order 


uint32 t ntohl(uint32 t netlong); 
uinti6_t ntohs(uniti6, t netshort); 
Returns: value in host byte order 





The htoni function converts an unsigned 32-bit integet from host byte order to 
network byte order. The ntoh1 function converts an unsigned 32-bit integer from 
network byte order to host byte order. The htons and ntohs functions perform 
corresponding conversions for unsigned 16-bit integers. Note that there are no 
equivalent functions for manipulating 64-bit values. 

IP addresses are typically presented to humans in a form known as dotted- 
decimal notation, ‘where each byte is represented by its decimal value and sep- 
arated from the other bytes by a period. For example, 128.2. 194.242 is the 
dotted-decimal representation of the address 0x8002c2£2. On Linux systems, you 
can use the HOSTNAME commánd to determine the dotted-decimal address of your 
own host: 


linux» hostname -i 
128.2.210.175 


Application programs can convert back and forth between IP addresses and 
dotted-decimal strings using the functions inet, pton and inet. ntop. 


#include <arpa/inet.h> 


int inet_pton(AF_INET,: const char *src, void) *dst) ; 
Returns: 1 if OK, 0 if src is invalid dotted decimial,.—1 on error 


du RT * 
‘ const char *inet ntop(AF. INET, const void *src, char *dst, 
Socklen t size); ^ ^ 
Returns: pointer to a dotted-decimal string if OK, NULL on error 





In these function names, the, ^n" stands for network and the “p” stands for pre- 
sentation. They can manipulate either 32-bit IPv4saddresses (AF. INET), as shown 
here, or 128-bit IPv6 addresses (AF. INET6), which we do not cover. 

The inet, pton function converts a dotted:decimal string (src) to a binary IP 
address in network byte order (dst). If src does not point to a valid dotted-decimal 
string, then it,returns 0. Any other error returns —1 and sets errno. Similarly, the 
inet ntop function converts a binary IP address in network byte order (src) to 
the corresponding dotted-decimal representation and copies'at most àize bytes 
of the resulting null-terminated string to dst. 
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Complete the following table: 


Hex address Dotted-decimal address A 


OxO PR DESDE RO EMIT 


Oxffffffff = __. 
0x7f000001 nne 
~ < 205.188.160.121 
m , 84.12.149.13 


| — PR 205.188.146.23. 





linux? ./hex2dd 0x8002c2f2, 
128.2.194.242 ‘ 





Write a program dd2hex.. c that conyerts its dotted-decimal NEUEM to a hex 
number and prints the, result. For example, 


Linux ./dd2hex 128.2.194.242 
0x8002c2£2 l 


i 


| 
| 
| 
string and prints the result. For example, 


11.3.2 Internet Domain Names 


E 


Internet clients arid servers üsé IP addresses when they communicate with each 
other. However, large integers are difficult for people to remember, so the Internet 
also defines a separate set of more human-frierldly domain names, as well as a 
mechanism that maps the set of domain names to the set of IP addresses. A domain 
name is a sequence of words (letters, numbers, and dashes) separated by periods, 
such as whaleshark.ics.cs.cmu.edu. 

The set of domain names forms a hierarchy, and each domain name encodes 
its position in the hierarchy. An example is the easiest way to uiderstand ‘this. 
Figure 11.10 shows a portion of the domain name hierarchy. 

The hierarchy is represented as a tree. The nodes of the tree represent domain 
names that are formed by the path back to the root. Subtrees are ‘referred to as sub- 
domains. 'The'first level in the hierarchy is ań uhnathed root node. The next level 
is a collection of first-level domain names that are defined by a nonprofit organi- 
zation called ICANN (Internet Corporation for Assigned Names and Numbers). 
Common first-level domains include com, edu, gov, org, and net. 
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Unnamed root . 

mil edu gov com First-level'domain names 
mit cmu berkeley amazon Second-level domain names 
cs ece www Third-level domain names 

ra E 176.32.98.166 $ 
ics pd! 
whaleshark WWW 


128.2.210.175 — 128.2.131.66 


Figure 11.10 Subset of the Internét domain name hierarchy. 


At the next level are second-level domain names such as cmu. edu, which are 
assigned on a first-come first-serve basis by various authorized agents of ICANN. 
Once an organization has received d second-level domain ‘name; then it is free to 
create any other new domain name within its subdomain, suth a8'cs. dim. edi. 

The Internet defines a mapping between the set of domain names and the 
set of IP addresses. Until 1988, this mapping was maintained manually in asin- 
gle text file called HOSTS . TXT. Since then, the mapping has been maintained in a 
distributed worldwide database known as DNS (Domain Name System). Concep- 
tually, the DNS database consists of millions of host entries, each of which defines 
the mapping between a set of domain names and a sèt of TP addresséK'In a math- 
ematical sense, think of each host entry as an equivalence. class,of domain names 
and IP addresses, ‘We cari explore some of the properties of the DNS mappings 
with the Linux NSLOOKUP program, which displays the IP addresses associated with 
a domain name, F . 

, Each Internet,host has the locally defined domain name localhost, which 
always maps to the loopback address 127 . 0. 0.1: 


r 
x ait * 
linux», nslookup localhost slc 
Address: 127.0.0.1 us ; 
‘ E * 4 " : v, te r ^ 
The localhost name provides a convenient and portable way to reference clients 
nd servers that are running.on,the same machine, which can be especjally useful 
AI ,8l IB rue san n esperjasy " 
st HB 
2 * t 


1. We've reformatted the output of NSLOOKUPsto improve readability. ' d 
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for debugging. We can use HOSTNAME to determine the real domain, name of our 
local host: 


linux» hostname : 
whaleshark.ics.cs.cmu.edu 


te ^ 


In the simplest case, there is a one-to-one mapping between a domain name 
and an IP address: 


linux» nslookup whaleshark.ics.cs.cmu.edu 
Address: 128.2.210.175 


e 


However, in sóme cases, multiple domain names are mappéd to the same IP 


address: 

à 
linux» nslookup .cs.mít.edu | 
Address: 18.62.1.6 r 


linux> nslookup eecs.mit.edu 
Address: 18.62.1.6 s 


In the niost general case, multiple’ domain names are mapped to the same set of 


multiple TP addresses: ve B 
‘ 


linux» nslookup www.twitter.com 
Address: 199.16.156.6 1 
Address: 199.16.156.70 3 
Address: 199.16.156.102 r 
Address: 199.16.156.230 
P ta? 
linux> nslookup twitter.com 
Address: 199.16.156.102 
Address: 199.16.156.230 M" 
Address: 199.16.156.6 
Addrbás: 199. 16156.70 
2 r 

Finally, we notice that some valid domain:shämes are not mapped to any IP 

address: nh 


linux» nslookup edu 

*** Can't find edu: No answer 

linux? n$looküp ic8.c5.cmu.edu i 
*** Can't find ics.cs.cmu.edu: No answer 


11.3.3 Internet Connections 


Internet clients and servers communicate by sending and receiving streams of 
bytes over connections. A connection is point-to-point in the sense that it connécts 
a pair of processes. It is full duplex in the sense that data can flow in both directions 
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Twice a year since 1987, the Interhet Systems Consortium conducts the Internet Domain Survey. The * 
ı survey, which. estimates the number of Internet ‘hosts by” counting the number of. IP addréSses that i 
have been assigned a domain name, reveals an amazing trend. Since 1987, ‘when thére were abóut, 
20,000 Internet hosts, the number.of host$ has been increasing "exponentially. By 2015,there.werg over , 
1,000,000,000 Internet hosts! g » " 


2 3 * 
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at the same time. And it is reliable in the sense that—barring some catastrophic 
failure such as a cable cut by the proverbial careless backhoe operator—the stream 
of bytes sent by the source processis eventually received by the destination process 
in the same order it was sent. 

A socket is an end point of a connection. Each socket has a correspondam 
socket address that consists of an Internet address and a 16-bit integer port? and 
is denoted by the notation address: port. 

The port in the client’s socket address is assigned automatically by the kernel 
when the client makes a connection request and is known ag an ‘ephemeral port. 
However, the port in the server’s socket address is typically some well-known 
port that is permanently associated with the service. For example, Web servers 
typically use port 80, and email servers use port 25. Associated with each service 
with a well-known port is a corresponding well-known service name. For example, 
the well-known name for the Web service is http, and the well-known name for 
email is smtp. The mapping between well-known names and well-known ports is 
contained in a file called /etc/services. . 

A connection is uniquely identified by the socket addresses of its two end 
points. This pair of socket addresses is known as a socket pair and is denoted by 
the tuple 


(cliaddr : cliport, servaddr :servport) 


where cliaddr is the client's IP address, cliport is the client's port, servaddr is the 
server's IP address, and servport is the server's pórt. For example, Figure 11.11 
shows a connection between a Web client and a Web server. 

In this example, the Web client's socket address is 


128.2.194.242:51213 


where port 51213 is an ephemeral port assigned by the kernel. The Web seryer's 
socket address is 


208.216.181.15:80 


2. These software ports have no relatiorto,the hardware ports in network-switches and routers. 
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‘Asidé Origins of the Internet « 
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* Thedhternet i$ one ofthe most successful examples of government, university, and industry. partnership. 
Many factors.contributed fo its success, but we think two are particularly important: a stistained 30- 
year investment 'by the United States governmeht and a commitment by.passionate researchers to what 
Dave Clarke at MIT has dubbed *rough'conseilsus ‘and working Code? 4" 

The: seeds of the Internet were-sown iñ 1957, when; atithe height of the Cold War; tlie Soviet 
, Union shocked thé world by launching Sputnik, the first artificial earthsatellite. In response, the United 
; States government created the Advanced Research Projects Agency (ARPA), whose charter was to 
reestablish the US lead in science and technology. In- 1967, Lawrence Roberts at ARPA published 
x, Plans fog a new network called the ARPANET. The first ARPANET nodes were up and running by 
` 1969. By 1971, there were 13 ARPANET nodes, and'email had emerged as the first important network 
i application. z * so 

4 In 1972, Robert Kahn outlined the general principles of internetworking: a collection of intercon- 
f nected networks, with communication between,the networks handled.independehtly on a “best-effort 
| basis" by black boxes calléd “routers.* In 1974, Kahn and'Vinton Cerf publishéd the first details of 
a TCP/IR which-by4982 had bécome the standard intérnetworking protocol for ARPANET. On January 
1 4, 1983:every'node on the ARPANETiswitched to-TCP/IP, markirig the birth of the global IP Internet. 
f ~- 101985, Paul Mockapetris inventéd DNS, aid there were over 1,000 Internet hosts. The next year, 
! the National. Science Foundation (NSF) built the NSFNET backbone éonnecting 13 sites with 56 Kb/s" 
1 phone lines, It was upgraded to 1.5 Mb/s T1 links in 1988 and 45 Mb/s T3 links in 1991. By 1988, there 
. were more than 50,000 hosts. In 1989, the original ARPANET was officially retired. In 1995, When there 
f were Almost 10,000,000 Internet hosts, NSF retired NSENET and replaced it with the modern Internet 

architecture, based on private commercial backbones-connected by public network access points. 
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Figure 11.11 Client socket address Server socket address 
Anatomy of an Internet 128.2.194.242:51213 208.216.181.15:80 


connection. : 












Server 
(port 80) 






Connection socket pair 
(128.2.194.242:51213, 208.216.181.15:80) ; `> 


Client host address Server host address 
128.2.194.242 208.216.181.15 







where port 80 is the well-known port associated with Web services. Given these 
client and server socket addresses, the connection between the client and server 
is uniquely identified by the socket pair 


(128.2.194.242:51213, 208.216.181.15:80) 


nme 


« 


è 





| 
| 
| 
! 
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* The original sockets interface was developed by researchers at University of California, Berkeléy,n 
4, the early 1980s. Forethis reason, it is ofter-referred to as Berkeley sockets; The Berkeley'rescarchers 


+ 


developed the sockets interface to work-with any underlying protocol. The first implementation was 


* for TCP/IP, which they included in the Unix 4.2BSD kernel and distributed to numerous universities 


* and labs. This was ar-important event in Internet history. Almost overnight, thousands of people had 
access to TCP/IP and its source codes. It generated tremendous excitement and sparked a flurry of new 
_ research in networking and-internetworking, E , > 
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11.4 The Sockets Interface 


The sockets interface is a set of functions that are used in conjunction with the Unix 
I/O functions to build network applications. It has.been implemented on most 
modern systems, including all Unix variants as well as Windows and Macintosh 
systems. Figure 11.12 gives an overview of the sockets interface in the context of a 
typical client-server transaction. You shoüld use this picture as a road map when 
we discuss the individual functions. 


Client Server 


getaddrinfo getaddrinfo 








open listenfd 


open. clientfd 




















> 
x Connection 
request 
MT. [sem] 
> : ; Await connection 
next client 


Figure 11.12 Overview of network applications based on the, sockets interface. 
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—— ÁÁ————————— RED code/netp/netpfragments.c 
/* IP socket address structure */ 
struct sockaddr in { 


uinti6.t sin family; /* Protocol family (always AF INET) */ 

uinti6 t sin port; /* Port number in network byte order */ 
1 

struct in_addr sin addr; /* IP address in network byte order */ 


unsigned char | sin zero[8]; /* Pad to sizeof(struct sockaddr) */ 

F; ' i 
t 

/* Generic socket address structure (for connect, bind, and accept) */ 
struct sockaddr { Ai 

uinti6_t sa family; /* Protocol family */ 

char sa.data[14]; /* Address data, */ ^ 
ii 


cocci e ER 


SS 7 code/netp/netpfragments.c 





Figure 11.13 Socket address structures. 
1 t 


11.4.1. Socket Address Structures, 


From the perspective of the Linux kernel, a socket is an end point for communi- 
cation. From the perspective of a Linux program, a socKet is an open ‘file with a 
corresponding descriptor. 
Internet socket addresses are stored in'16-byte structures having the type 
Sockaddr. in, shown in Figure 11.13. For Internet applications, the sin family 
field is AF INET, the sin, port field is a 16-bit port number, and the sin, addr 
field contains a 32-bit IP address. The IP address and port number are always 
stored in network (big-endian) byte order. 
The connect, bind, and accept functions require-a* pointer to a protocol- 
specific socket address struéture. The problem faced by the designers of the sockets 
interface was-how to define these functions to accept any kind of socket address i 
structure, Today, we would use the generic void * pointer, which did not exist in 
Cat that time. Their solution was to define sockets functions to expect a pointer to 
a generic sockaddr structure (Figure 11.13) and then require applications to cast i 
any poigters,to;protocol;specific structures to this generic structure. To simplify 
our code.examples, we follow'Stevens's lead and define the, following, type: | 





typedef struct sockaddr SA; 
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We then use this type whenever we need to cast a sockaddr_in structure to a 
generic sockaddr structure. 


11.4.2 The socket Function 


Clients and servers use the socket function to create a socket descriptor. 


#include <sys/types.h> 
#include <sys/socket.h> 


int socket(int domain, int type, int protocol); 
Returns: nonnegative descriptor if OK, —1 on error 





If we wanted the socket to be the end point for a connection, then we could call 
socket with the following hardcoded arguments: 


clientfd = Socket (AF_INET, SOCK STREAM, 0); 


where AF_INET indicates that we are using 32-bit IP addresses and SOCK_ 
STREAM indicates that the socket will be an end point for a connection. However, 
the best practice is to use the getaddrinfo function (Section 11.4.7) to generate 
these parameters automatically, so that the code is protocol-independent. We will 
show you how to use getaddrinfo with the socket function in Section 11.4.8. 

The clientfd descriptor returned by socket is only partially opened and 
cannot yet be used for reading and writing. How we finish opening the socket 
depends on whether we are a client or a server. The next section describes how 
we finish opening the socket if we are a client. 


11.4.3 The connect Function 


A client establishes a connection with a server by calling the connect function. 


#include <sys/socket.h> 


int connect(int clientfd, const struct sockaddr *addr, 
Socklen t addrlen) ; 
Returns: 0 if OK, —1 on error 





The connect function attempts to establish an Internet connection with the server 
at socket address addr, where addrlen is sizeof (sockaddr_in). The connect 
function blocks until either the connection is successfully established or an ertor 
occurs. If successful, the clientfd descriptor is now ready for readirig and writing, 
and the resulting connection is characterized by the socket pair 


(x:y, addr.sin, addr:addr.sin port) 
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where x is the client's IP address and y is the: ephemeral port that uniquely 
identifies the client procéss on the client host. As with socket, the best practice is 
to use getaddrinfo to supply the arguments to connect (see Section 11.4.8). 


11.4.4 The bind Function 


D : 
The remaining sockets functions—bind, listen, and accept—are used by servers | 
E 


to establish connections with clients. 


7 ay 


T 
#include <sys/socket .h> 


int bind(int sockfd, const. struct sockaddr »hddr, 
socklen_t addrlen); 
‘Returns! 0 if OK, —1 on error 








The bind function asks the kernel to associate the server's'socket address in addr 
with the socket descriptor sockfd. The addrlen argument is sizeof (sockaddr_ 
in). As with socket and connect, the best‘practice ‘is to use'getaddrinfo to 
supply the arguments to bind (see Section 11.4.8). 


11.4.5 The3isten Function 


Clients are active entities that initiate connection requests. Servers are passive 
entities that wait for connection requests from clients, By default, the kernel 
assumes thata descriptor created by the socket function ‘Corresponds to an active 
socket t that will live on the client end of a connéction. A server calls the listen 
function to tell’the kernel that the descriptor will be used bya server instead of a 
client. 


H 
»#include <sys/socket.h>, 


C Returns: 0 if OK, —1 on error 


Por 





£ ra v i 1 

The listen function converts sockfd from air active socket to a listening socket 
that can accept connectiUn requests from clients. The'backlog argüment is a hint 
about the number ofoutstanding connection requests thafthe kernel should queue 
up before ‘it starts to refuse requests. The exact meaning òf the backlog argument 
requires an understanding of TCP/IP that is beyohd our scope. We will typically 
set it to a large value, such as 1,024. 


i 

‘int listen(int/sockfd, int backlog) ; | | 
i 
| 
| 
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listenfd(3) 
h 1. Server blocks in accept, 
waiting for connection lequédt on 
listening descriptor Íistentd. 


clientfd 


request 
TEL UELLE T 2. Client makes connection request by 
| Sener | calling and blocking in connect. 


clientfd ; 
listenfd(3) 3. Server returns connfd from accept. 
i Q S Client returns from connect. Connection 
Client erver is now established between clientfd 
and fd. 
clientfd connfd(4) es 


1* 
Figure 11.14 The roles of the listening and connected descriptors. 


11.4.6 Thé accept Function 


Servers wait for connection requests from-clients by calling the accept function. 


f 


#include <sys/socket.h> 


int accept(int listenfd, struct sockaddr *addr, int *addxlen); 
Returns: nonnegative one descriptor if OK, —1 on error 


2 ED 1 ir 





The accept, füńction waits for, a connection request froma client to arrive oit 
the listening descriptor listentd, then fills 1 in the client's socket address ih addr, 
and returns a connected descriptor that can bé used to communicate with the’ client 
using Unix I/O functions. 

The distinction between a listening descriptor and a connected descriptor 
confuses many students. The listening descriptor serves as an end point for client 
connection requests. It is typically created once and exists for the lifetime of 
the server. The connected descriptor is the end point of the connection that is 
established between the client and the server. It is created each time the server 
accepts a connection request and exists only as long asit takes the server to service 
a client. RN 

Figure 11.14 outlines the roles of the listening and connected descriptors. In 
step 1, the server calls accept, which waits for a connection request to arrive on 
the listening descriptor, which for,concreteness we will-assume is descriptor 3. 
Recall that descriptors 0-2 are reserved fer the'standard files. UE 

In step 2, the client calls the connect function, which sends a connection 
request toilistenfd. In step 3,.the accept function opens a new connected de- 
scriptor connfd (which we will assume is descriptor 4), establishes the connection 
between clientfd and connfd, and then returns connfd to the application. The 
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client also returns from the connect, and from this point, the client and server 
can pass data back and forth by reading and writing clientfd and conntd, re- 
spectively. 


11.4.7 Host and Service Conversion 


Linux provides some powerful functions, called getaddrinfo and getnameinfo, 
for converting back and forth between' binary socket address structures and the 
string representations of.hostnames, host addresses, service names, and port 
numbers. When used in conjunction with the sockets interface, they allow us to 
write network programs that are independent of any particular version of the IP 
protocol. 


The getaddrinfo Function 


The getaddrinfo function converts string representations of hostnames, host 
addresses, seryice names, and, port numbers into socket address structures.,It is 
the modern replacement for the obsolete gethostbyname and getservbyname 
functions. Unlikt these functions, it is reentrant (see Section 12.7.2) and works 
with any protocol 


#include <sys/types.h> 
#include <sys/socket .h> 
include «netdb.h» 


int getaddrinfo(const char *host, const char‘ *service, 
const struct addrinfo *hints, 
struct addrinfo **result); 
Returns: 0 if OK, nonzero error code on error 


yoid freeaddrinfo(struct addrinfo *result); 
Returns: nothing 


const char *gai strerror(int errcode); 
Returns: error message 
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addrinfo structs 


se >» 
ai_canonname 
ai_addr 


al_next 














Socket address structs 
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ai next 


i a $ 
y 


ai addr 


NULL 





Given host and service (the two components of a socket address), getaddrinfo 
returns a result that points to a linked list of addrinfo structures, each of 
which points to a socket address structure that corresponds to host and service 
(Figure 11.15). 

After a client calls getaddrinfo, it walks this list, trying each socket address 
in turn until the calls to‘ socket and connect succeed and the connection is 
established. Similarly; a server tries edch socket addtess on the list until the calls 
to sockét and bind succeed and the descriptor is bound to'a valid socket address. 
To ‘avoid: memory leaks, the application must eventually freë the list by‘calling 
freeaddrinfo. If getaddrinfo returns a nonzero error code, the application can 
call gai_strerror to convert the code to a message string. r 

The host argument to getaddrinfo can be either a domain name or a numeric 
address (e.g., a dotted-decimal IP address). The service argument can be either 
a service name (e.g., http) or a decimal port number. If we dre not interested in 
converting the hostname to an address, we can set host to NULL. The same holds 
for service. However, at least'one of them must be specified. 

The optional hints argument is an addrinfo structure (Figure 11.16) that 
provides finer control over the list of socket addresses that getaddrinfo re- 
turns. When passed as a hints argument, only the ai family, ai socktype, 
ai, protocol, and ai flags fields can be set. The other fields must be set to zero 
(or NULL). In practice, we use menset to zero the entire structure and then set a 
few selected fields: z 


* By default, getaddrinfo can return both IPv4 and IPv6 socket addresses. 
Setting ai family to AF. INET restricts the list to IPv4 addresses. Setting it 
to AF INET6 restricts the list to IPv6 addresses. 
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: Let code/netp/netpfragments.c 
struct addrinfo { r 


int ai'flags; /* Hints argument flags */ 

"C nt ai family; /* First arg tói socket function */ 
int ai socktype; /* Sécond arg to socket function */ 
int ai protocol; /* Third arg to socket function */ 
char *ai -canoniame; 7* Canonical hostname */ 
size_t ai_dddrlen; /* Size of ai addr struct */ 
struct sockaddr *ai' addr'! 7* Ptr to socket- address structure */ 

i struct addrinfo *ai_next; /* Ptr to next item”in linked list */ 
j; SE D 5 


thi 
code/netp/netpfragments.c 


Figure 11.16 The addrinfo structure used by getaddrinfo. 
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* By default, for each unique address associated with host, the getaddrinfo 
function can returmup to three addrinfo structures, each with a different ai_ 
socktype field: one for connections, one for datagrams (not covered), and 
one for raw sockets (not covered). Setting ai_socktype to SOCK 'SFREAM 
restricts the list to at most one addrinfo structure for each unique address, 
one whose socket address can be used as the end point ‘of a connection. This 
is the desired behavior for all, of our example programs. 


* The ai. flags field is a:bit mask that further modifies the default behavior. 
You create it by oring combinations of various values. Here are some that we 
find useful: 

AI -ADDRCONFIG. This flag is recommended if^ you are using connec- 
tions’ [34]. It asks getaddrinfo to, return IPv4 addresses only if the 
local] ost is configured for IPv4. Similarly for IPv6. ` 

AL "CANONNAME. By € default, the &i_canonname field is NULL. If this 
flag i is set, it instructs getaddrinfo to point the ai, canonnàme field in 
the first addrinfo structure ir in ihe list to the canonical (officiaf) name 
of host (see Figure 11. 15). T 

AI NUMERÍCSERV. By defauli, the service argument can be a service 
name or a port number. This flag forces the service argument to be 
a port number. 

AI PASSIVE. By default, getaddrinfo returns socket addresses that can 
be used by clients as active sockets.in calls to connect. This flag 
instructs;it to return socket addressesithat can be used by servers as 
listening sockets. In this case, the host argument should be NULL. 
The address field in the resulting socket address structure(s) will be 
the wildcard address, which tells the kernel'that this server will accept 
requests to'any of the'IP addresses for this host. This is the desired 
"behavior for all of our example servers. 
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When getaddrinfo creates an addrinfo structure in the output list, it fills 
in each field except for ai_flags. The ai_addr field points to a socket address 
structure, the ai_addrlen field gives the size of this socket address structure, and 
the ai_next field points to the next addrinfo structure in the list. The other fields 
describe various attributes of the socket address. 

One of the elegant aspects of getaddrinfo is that the fields in an addrinfo 
structure are opaque, in the sense that they can be passed directly to the functions 
in the sockets interface without any further manipulation by the application code. 
For example, ai_family, ai_socktype, and ai, protocol can be passed directly 
to socket. Similarly, ai addr and ai addrlen can be passed directly to connect 
and bind. This powerful property allows us to write clients and servers that are 
independent of any particular version of the IP protocol. 


The getnameinfo Function 


The getnameinfo function is the inverse of getaddrinfo. It converts a socket ad- 
dress structure to the corresponding host and service name strings. It is the modern 
replacement for the obsolete gethostbyaddr and getservbyport functions, and 
unlike those functions, it is reentrant and protocol-independent. 

5 


#inclyde «sys/socket.h» 
#include <netdb.h> 


int getnameinfé(tonst struct sockaddr *sa, socklen_t salen, 


char *host, size t hostlen, 
char *servicé, size t servlen, int flags); 
Returns: 0 if OK, nonzéro error code on error 





The sa argument points to a socket address structure of size salen bytes, host 
to a buffer of size hostlen bytes, and service to'a buffer of size servlen bytes. 
The getnameinfo function converts the socket address structure sa to the corre- 
sponding host and service name strings and copies them to the host,and service 
buffers. If getnameinfo returns a nonzero error code, the application can convert 
it to a string by calling gai_strerror. 

İfwe don’t want the hostname, we can sethost to NULL and host len to zero. 
The same holds for the service fields. However, one of the other must bé Set. 

The flags argument is a bit mask that modifies the default behavior. You 
create it by oring combinations of various values. Here are a couple of useful 


ones: 
" P4 


NI NUMERICHOST. By default, getnameinfo tries to return a domain name 
in host. Setting this'flàg will cause it to return a numeric address string 
instead. 


NI NUMERICSERY. By default, getàaneinfo will look in /etc/services 
and if possible, return a service name instead of a port number. Setting 
‘this flag forces it to skip the lookup and simply return the port number. 
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er Se ce Sr oe ee Sv See code/netp/hostinfo.c 
1 #include "csapp.h" ps s ' < 
» MEI I $ 
3 int afin Catt argc, char **argv)" 
w t x 
5 struct addrinfo *p, *listp, hints; > 
6 * chìr but [MAXLINE] ; oa 
7 int rc, flags; 
8 + 
9 if (arge != 2) (1 7 
10 fprintf(stderr, "usage: %s «domain name»Mn",.argv[0]); 
11 exit(0); 
12 H 
13 
14 /* Get a list of addrinfo records */ 
15 memset (khints, 0, sizeof(struct addtinfo)); aS 
16 x hints.ai family = AF INET;. ipRIPv4. only */ 
17 w hint’s. ai_socktype = SOCK STREAM; /4* Connections only */ 
18 if ((rc.- getadürinfo(argv[i1], NULL,c£hints, &listp))n!- 0) 1 
19» 7X fprintf(stderr, "getaddrinfo' error: %s\n", gai'strerror(rc)); 
20 exit(1); & 
21 } 
22 
23 /* Walk the list and display each IP address */ 
24 flags = NI_NUMERICHOST;$Z4.Display addre$s string:instedd of domain name */ 
25 for (p = listp; p; p = p-^ai next) { 
26 í ! — 'Getnahédinfo(p-»ai hddr, p->ai_addrién, buf, MAXLINE; NULL, 0, flags); 
27 printi("%s\n", buf); a 
22.7 ] í UMEN z 
29 ‘ p 
30 /* Clean up */ 
31 Freeaddrinfo(listp); 
32 , 
33 exit(0); , 
34 ] 
d 


code/netp/hostinfo.c 


Figure 11.17 HosTINFO displays the mapping of a domain name to its associated IP addresses. 


- 


23485 ' Me 

Figure 11.17:shows a simple program, called HosTINFO}tHat uses getaddrinfó i f 

and getnameinfo-to display the mapping.of a domain name to its:associated IP | 
addresses. It is.similar.to the NSLOOKUR program from Sectign:11.3.2. i 
' ‘First, we initialize. the.hints structure so that.getaddrinfo returns the ad- | 4] 
dresses we want. In thi$ case, we are looking for 3?-bit;IP addresses (line: 16) MEE 
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that can be used as end points of connections (line 17). Since we are only asking 
getaddrinfo toconvert domain names, we call it with a NULL service argument. 

After the call to getaddrinfo, we walk the list of addrinfo structures, using 
getnameinfo to convert each socket address to a dotted-decimal address string. 
After walking the list, we are careful to free it by calling freeaddrinfo (although 
for this simple program it is not strictly necessary). 

When we run HOSTINFO, we see that twitter.com maps to four IP addresses, 
which is what we saw using NSLOOKUP in Section 11.3.2. 


linux» ./hostinfo twitter.com 
199.16.156.102 

199.16.156:230 

199.16.156.6 

199.16.156.70 





The pos stradari info and gethakeinro functióné subsume the funeonality ofi inet_ 
pton and inet_ntop, respectivèly, and they provide a higher-level of abstraction 
that is independent of any.particular address format. To convince yourself how 
handy this is, write a version of nosrINFO (Figure 11.17) that uses inet ntop in- 
stead of getnameinfo to convert each socket address to a dotted-decimal address 
string. 





v 
11.4.8 Helper Functions for the Sockets Interface i 


The getaddrinfo function and the sockets interface can seem somewhat daunting 
when you first learn about them. We find it convenient to wrap them with higher- 
level helper functions, called open. c1ientfd and open. 1istentd, that clients and 
servers can use when they want to communicate with each other. 


The open. clientfd Function 


a 


A client establishes a connection with a server by calling open_clientfd. 


#include "csapp.h" 
oath 


int open_clientfd(char *hostname, char *port); 
‘ M 5 
Returns: descriptor if OK, —1 on error 





The open, clientfd function establishes a connection with a server running on 
host hostname and listening for connection requestston port flumber port. It 
returns àn open socket descriptor that is ready for input and output using the 
Unix I/O functions. Figure 11.18 shows the code for open. clientfd. 

We call getaddrinfo, which returns a list-of addrinfo-structures, each of 
which points to a socket address structure that is suitable for establishing a-con- 

































i 
i 
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- = tt o — ——- code/src/csapp.c ! 
1 ,jnt open clientfd(char *hostname, char, *port) { i 
2 int clientfd; "MA l 
3 struct addrinfo hints, *listp, *p; 
4 si 
5 /* Get a list of potential server addresses */ 
6 memset (&hints, 0, sizeof(struct addrinfo)); 
7 hints.ai socktype = SOCK STREAM; /* Open a cohnection */ 
8 hints.ai flags = AI NUMERICSERV; /* ... using a numeric port arg. */ | 
9 hints.ai flags |= AI ADDRCONFIG; /* Recommended for connections */ | 
10 , Getaddrinfo(hostname, port, khints, £listp);, i 
n T \ *t t r 
12 /*,Malk the list for one that we can successfully connect to */ 
13 for, (p = listp; p; p= p-^ai.next) { o "y 
14 /* Create a socket descriptor */ B. n 
15 m if, ((clientfd = socket(p-»ai family, p->ai_socktype, ,p->ai_protocol)) < 0) 
16 > i continue; /* Socket failed, try the next..*/ 
7 a, t 2 
18 /* Connect to the senyen :*/ 
19; i£ (connect(clientfd, pt>ai_addr, p-»ai addrlen) != -1), 
20 break; /* Success */ nb E 
21 Close(clientfd); /* Connect failed, try; another */ 
22, } i 
23 ou 
24 /* Clean up */ 4o d 


27 return -1; wT s 
else /* The last connect succeeded */ 
t Le 
return clientfd; 


i , z 


code/src/csapp.c 


Figure 11:18 'opeü clientfd: Helper function that establishes a-connection with å server. It is reentrant 


and protocolindéperident — ^ — ^ 


íi nu it 

—. nection with a server running om hostname and listening on port. We then walk 
the list, trying each list entry in turn; until the calls to socket and connect suc- 
ceed. If the connect fails, we are careful to close.the socket descriptor before 
trying the next entry. If the connect succeeds, we free'the.list memory and return 
the socket descriptor to the client, which can immediately begin using Unix I/€) 

to communicate with the server. ' 
Notice how there is no dependence on any particular-version of IP anywhere 
in the code. The arguments to socket and connect are generated for us automat- 


i 
25 Freeaddrinfo(listp); 
26 if (!p) /* All connects failed */ 
i 
ically by getaddrinfo, which allows-our code to be clean and portable. 
| 
i 
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The open_listenfd Function " 


A server creates a listening descriptor thatis ready to receive connection requests 
by calling the open_listenfd function. 


#include "csapp.h" 


int open listenfd(char *port); 


Returns: descriptor if OK, —1 on error 





The open_listenfd function returns a listening descriptor that is ready to receive 
connection requests on port port. Figure 11.19 shows the code for open. listenfd. 

"Fhe:style is similar to opén, clientfd. We call getaddrinfo ard then walk 
the resulting list until the calls to socket and bind succeed. Note that in line 20 
we use the setsockopt function (not described here) to configure the server so 
that it canbe terminated, be restarted, and begin accepting connection requests 
immediately. By default, a restarted server will deny connection requests from 
clients for approximately 30 seconds, which seriously hinders debugging. 

Since we have called getaddrinfo with the AI PASSIVE flag and a NULL 
hóst^argument, the address field-in each socket address structure is set to the 
wildcard address, which tells the kernel that this server Will dccept requests to any 
of the IP addresses for this host. 

Finally, we call the Listen function to convert listenfd to a listening descrip- 
tor and return it to the caller. If the Listen fails, we are careful to avoid a memory 
leak by closing the descriptor before returning. 


11.4.9 Example Echo Client and Server 


The best way to learn the sockets interface is to study example code. Figure 11.20 
shows the code for an echo client. After establishing a connection with the server, 
the client enters a loop that repeatedly reads a text line from standard input, sends 
the text line to the server, reads the echo line from the server, and prints the result 
to standard output, The loop terminates when fgets encounters EOF on standard 
input, either because the user typed Ctrl+D at the keyboard or because it has 
exhausted the text lines in a redirected input file. 

After the loop terminates, the client closes the descriptor. This results in an 
EOF notification being sent to the server, which it detects when it receives a return 
code of zero from its rio_readlineb function. After closing its descriptor, the 
client terminates. Since the client’s kernel automatically closes all open descriptors 
when a'process terminates, the close in line 24 is not necessary. However, it is good 
programming practice to explicitly close any descriptors that you have opened. 

Figure 11.21 shows the main routine for the-echo server.. After. opening the 
listening descriptor, it enters an infinite loop. Each iteration waits for a connection 
request from a client, prints the domain name and port of the connected client, and 
then calls the echo function that services the client. After.the echo routine returns, 
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i nnn code/sre/csapp.c 


1 int open listenfd(char *port) sL 
2 i 
3 Struct addrinfo hints, *listp, *p; 
4 int listenfd, optval-1; 
5 
6 /* Get a list of potential server addresses */ 
7 memset (&hints, 0, sizeof(struct addrinfo)); 
8 hints.ai socktype - SOCK, STREAM; /* Accept connections */ 
9 hints.gi flags = AI PÀSSIVE | AI -ADDRCONFIG; /* ... on any IP address */ 
10 hints.ai flags |= AI, NUMERICSERV; /* ... using port number */ 
n Getaddrinfo(NULL, port, &hints, klistp); 
12 
13 /* Walk the list for one that we can bind to */ 
14 for (p = listp; p; p = p-^ai next) { 
15 /* Create a socket descriptor: */ 
16 if ((listenfd = socket(p-5ai family, p-^ai socktype, p->ai_protocol)) < 0) 
17 continue; /*:Socket failed, try the next */ 
18 : 
19 /* Eliminates "Address already ip use" error from bind */ 
20. Setsockopt (listenfd, SOL, SOCKET, SO, REUSEADDR, 
21 (const void *)&optval , sizeof(int)); 
22 
23 /* Bind the descriptor to the address */ 
24 if (bind(listenfd, p->ai_addr, p->ai_addrlen) == 0) 
25 break; /* Success */ 
26 Close(listenfd); /* Bind failed, try the next */ 
27 } 
28 
29 /* Clean up */ 
30 Freeaddrinfo(listp); 
31 if (!p) /* No address worked */ : 
32° return -1; i 
ee ! 4 09 m , 
34 /* Make it a listening. sock. t ready d accept connection requests */ 
35. if (listén(listenfd, Lsrivdj <o Å, 
36 Chose (Listenta) ; 
37 return -1; : T 
€ ree es 
38 } ; . 
39 return listenfd;, 5 f É 
4 3} 


te te, 


FE CODE/S1C/CSAPD.C 


isi 


Opens and returns a listening descriptor. It is 


Figure 11.19 open listenfd: Helpér function that 
reentrant and protocoltindependent: d 
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ooo — — —— code/netp/echoclient.c 


#include "csapp.h" 


1 
2 

3 int main(int argc, char **argv) 

4 { 

5 int clientfd; 

6 char *host, *port, buf [MAXLINE] ; 

7 rio_t rio; 

8 

9 if (argc != 3) ( 

10 fprintf(stderr, "usage: %s «host» <port>\n", argv[0)); 
11 exit (0); 

12 } 

13 host = argv[1]; 

14 port = argv[2]; 

15 

16 clientfd - Open, clientfd(host, port); 

17 Rio readinitb(&rio, clientfd); 

18 

19 while (Fgets(buf, MAXLINE, stdin) != NULL) { 
20 Rio_writen(clientfd, buf, strlen(buf)); 
21 Rio readlineb(&£rio, buf, MAXLINE); 

22 Fputs(buf, stdout); 

23 } 

24 Close(clientfd); 

25 exit(0); 

26 } 


ee —— ——. code/netp/echoclient.c 


Figure 11.20 Echo client main routine. 





the main routine closes the connected descriptor. Once the client and server have | 
closed their respective descriptors, the connection is terminated, 

The clientaddr variable in line 9 is a socket address structure that is passed | 
to accept. Before accept returns, it fills in clientaddr with the socket address of 
the client on the other end of the connection. Notice how we declare clientaddr 
as type struct sockaddr_storage rather than struct sockaddr_in. By defini- 
tion, the sockaddr_storage structure is large enough to hold any type of socket 
address, which keeps the code protocol-independent. 

Notice that our simple echo server can only handle one client at a time. 
A server of this type that iterates through clients, one at a time, is called an iterative 
server. In Chapter 12, we will learn how to build more sophisticated concurrent 
servers that can handle multiple clients simultaneously. 

Finally, Figure 11.22 shows the code for the echo routine, which repeatedly 
reads and writes lines of text until the rio_readlineb function encounters EOF 
in line 10. 
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code/netp/echoserveri.c 
1 #include "csapp.h" 
2 
3 void echo(int connfd); 
4 
5 int main(int argc, char **argv) 
6 1 
7 int listenfd, connfd; i 
8 Socklen t clientlen; 
9 struct sockaddr storage clientaddr; /* Enough space for any address */ 
10 char client. hostname[MAXLINE], client. port [MAXLINE]; 
11 
12 if (argc != 2) { 
13 fprintf(stderr, "usage: %s <port>\n", argv[0]); 
14 exit (0); 
15 } 
16 
17 listenfd = Open_ligtenfd(argv[1]); " 
18 while (1) { m ; 
19 clientlen = sizeof(struct,sockaddr storage); = 
20 connfd = Accept (listenfd‘, (SA"'*) &clientaddr , &clientlen) ; 
21 Getnameinfo((SA *) &clientaddr, clientlen, client_hostname, MAXLINE, 
22 client_port, MAXLINE, 0); 
23 printf("Connected to (%s, %s)\n", client hostname, client, port); 
24 echo(connfd); 
25-i Close(connfd); t : 
26 LEM a 
5 D ee: 2! 
27 exit(0); o t 
28 da £n H 


' s 
EE 4 M3 030 —— —Í1 cede/netpfechoserveri.c 
> 


Figure 11.21 Iterativeʻecho setver main routine. 





a — code/netp/echo.c 
1 #include "csapp.h" 
2 
3 void echo(int connfd) 
4 { 
5 size_t n; 
6 char buf [MAXLINE],;- i 
7 rio_t rio; 
8 
9 Rio_readinitb(&rio, cónnfd); ? 
10 while((n = Rio xeadlineb(E&rio, buf, MAXLINE)) != 0) ( 
11 printf ("server received 4d bytes\n", (int)n); 
12 Rio writen(connfd, buf, n); 
13 } 
14 } de 3 is E 


i E wna D D M ^ " 
Figure 11.22 echo function that reads and echoes text lines. 
E vo! d 4 
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Aside" What does EOF ona "eonnéction mean? 


Áo. To on a 

The idea of EOF is often confusing to ‘students, especially in the Context of Internet connections. First,» 
* we need to understand that'there is no such. thing as amEOF character. Rathet; EOFs a condition that 

, is detected by the kernel. An application finds out about the EOF condition when it receives a zero * 
return code froin the read fuiiction. For disk files, EOF occurs whew the currént file position exceeds , 
the file length. For Internet connections, EOF occurs wher a process closes its end ofsthe connection. 1 
The process at the other end of the connection detects the EOF when it- -attempts td read past the last ? 
* byte in the sitem: ae 
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11.5 Web Servers 


So far we have discussed network programming in the context of a simple echo 
server. In this section, we will show you how to use the basic ideas of network 
programming to build your own smal], but quite functional, Wéb server. 


11.5.1 Web Basics 


Web clients and servers interact using a text-based application-level protocol 
known as HTTP (hypertext transfer protocol). HTTP is a simple protocol. A Wéb 
client (known as a browser) opens an Internet connection to a server and requests 
some content. The server responds with the requested content and then closes the 
connection. The browser reads the content and displays it on the screen. 

What distinguishes Web services from conventional file retrieval servicessuch 
as FTP? The main difference is that Web content can be written in a language 
known as HTML (hypertext markup language). An HTML program (page) con- 
tains instructions (tags) that tell the browser how to display various text and 
graphical objects in the page. For example, the code 


<b> Make me bold! </b> 


tells the browser to print the text between the <b> and </b> tags in boldface type. 
However, the real power of HTML is that a page can contain pointers (hyperlinks) 
to content stored on any Internet host. For example, an HTML line of the form 


<a href="http://www.cmu.edu/index.html">Carnegie Mellon</a> 


tells the browser to highlight the text object’ Carnegie Mellon and to create’a 
hyperlink to arr HTML file called index.html that is stored on the CMU Web 
server. If'the user clicks gn the highlighted text object, the browser requests the 
corresponding HTML file from the CMU server and displays it. 
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fas Xia rM PAM 
Aside j Origins of the World'Wide Web 
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{physics lab. In 4-989, Berners-Leeswrote an intérnal mem "proposing a'distributed liypertext system 
; that would connect a ‘web‘f nótes with links." The iritent'ofthé?proposed systém was.to help CERN 
: scieritists share and manage infórmation. Over the next twotyéars, after’ Berpers-Lee implemented 
y the first Web server and Web browser, the Web developed a small following’ within CERN and a few" 
F other sites. A pivotal event occutred in 1993,"when Marc Añdreėser (who laterfourtded Netšcape and 
i Andreessen-Horowitz) and bis colleagues at.NCSA, released a graphical browsér called MOSAIC for all 
three major plátforms; Linux, Windows, and, Macintosh. After, the, release, of Mosaic, interest in the 
i Web exploded, with thé numberof Wep sites increasing at an exponential rate. By 2015, there were 
f over'975,000,000 sites wotldwide. (Source: Netcraft Web Survey) 

le ie ee RAS ME Rieti a aneita n a Rida aes nthe oan SME a a a » 
MIME type Description 

text/html HTML page 

text/plain Unformatted text 

application/postscript Postscript document 

image/gif E Binary image encoded in GIF format 

image/png Binary image encoded in PNG format 

image/ jpeg f „Binary image encoded in JPEG format 

OTT BM—M————————M————————— 


Figure 11.23 Example MIME types. 


11.5.2 Web Content 


1 


To Web clients ahd servers, conterit is a sequence of bytes with an associated MIME 
(multipurpose internét mail extensions) type. Figure 11.23’shows some common 
MIME types. 

Web servers provide content to clients in two different ways: 


aka 


* Fetch a disk file and return its contents to the client. The disk file is known 
as static content and the process of returning the file to the client is known as 
serving static content. 

* Run an executable file and returmits output to the client. The output produced 
by the executable at run time is known as dynamic content, and the process of 
running the program and returning its output to the client is known as serving 
dynamic content. 


Every piece of content returrfed by aWeb server is associated with some file 
that it manages. Each of these files has a unique name known as a URL (universal 
resource locator). For example, the URL 


http://www.google.com:80/index.html 


The World Wide Web was inverited by.Tim Berners-Lee, d software engineer workin gat CERN, a Swiss , 
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identifies an HTML file called /index.html on Internet host www.google.com 
that is managed by a Web server listening on port 80. The port number is op- 
tional and defaults to the well-known HTTP port 80. URLs for executable files 
can include program arguments after the filename. A ‘?’ character separates the 
filename from the arguments, and each argument is separated by an '£' character. 
For example, the URL ^ 


http://bluefish.ics.cs.cmu.edu:8000/cgi-bin/adder?15000&213 


identifies an executable called /cgi-bin/adder that will be called with two argu- 
ment strihgs: 15000 and 213. Clients and servers use different parts of the URL 
during a transaction. For instance, a client uses the prefix 


http://www.google.com:80 


to determine what kind of server to contact, where the server is, and what port it 
is listening on. The server uses the suffix 


/index.html 


to find the file on its filesystem and to determine whether the request is for static 
or dynamic content. 

There are several points to undetstand about how servers interpret the suffix 
of a URL: 


* There are no standard rules for determining whether a URL refers to static 
or dynamic content. Each server has its own rules for the files it manages. 
A classic (old-fashioned) approach is to identify a set of directories, such as 
cgi-bin, where all executables must reside. 


* The initial ‘/’ in the suffix does not denote the Linux root directory. Rather, it 
denotes the home directory for whatever kind of content is being requested. 
For example, a server might be configured so that all static content is stored 
in directory /usr/httpd/html and all dynamic content is stored in directory 
/usr/httpd/cgi-bin. 

* The minimal URL suffix is the ‘/’ character, which all servers expand to some 
default home page such as /index.html. This explains why it is possible to 
fetch the'home page of a site By simply typing a domain name'to the browser. 
The browser appends the missing '/' to the URL and passes it to the server, 
which expands the ‘/’ to some default filename. 


11.5.3 HTTP Transactions 


Since HTTP is based on text lines transmitted over Internet connections, we can 
use the Linux TELNET program to conduct transactions with any Web server on 
the Internet. The TELNET program-has been largely supplanted by ssu as a remote 
login tool, but it is very handy for debugging servers that talk to clients with text 
lines over connections. For example, Figure 11.24 uses TELNET to request the home 
page from the AOL Web server. 
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1 linux> telnet www.aol.com 80 Client», open connection to server 

2 Trying 205.188.146.23^».. Telnet prints 3,lines to the terminal 

3 Connected -to aol.com. ; 

4 Escape character is '^]'. te i: 

S GET / HTTP/L.1 Cljent;,request line 

6 Host: www.aol.com Client: required HTTP/1.1 header 

7 $ I Client: empty lipe terminates headers 

8  HTTP/1.0 200 OK i Server:+response line 

9 ; MIME-Version: 1.0 Server: followed.py five response headers 
10  Date:; Mon, 8 Jan 2010 4:59:42 GMT 

11 Server: Apache-Coyote/i.1 j 

12  ,Content-Type: text/html Server: expect HTML jn the response body 
13  Content-Length: 42092, : Server: expect 42,092 bytes.in, the response body 


14 Server: empty line terminates response headers 
15 <html> Server: first HTML line in response body 

16 ee Server: 766 lines of HTML not shown 

17 </html> Server: last HTML line in response body 


18 Còinection closed by foreign host. Server:'closes^dónnection © 
19% linux» f s WCliedt: closes connection and terminates 


Figure 11.24 Example of an HTTP transaction that serves static content. 


ri y 
Pr aff 
In line 1, we run TELNET from a Linux shell and ask it to open.a connection to 

the AOL Web server. TELNET prints three lines of output.to.the terminal, opens 
therconnection;.and then waits for us to enter text (line. 5). Each time we enter 
a text line and hit the enter key, TELNET reads the line, appends carriage return 
and line feed characters (‘\r\n’ in C notation); and sends the line to-the server. 
This is consistent with the HTTP standard, which requires every text line to be 
terminated by a carriage return and line feed pair. To initiate the transaction, we 
enter an HTTP request (lines 5-7). The server replies with an HTTP response 
(lines 8-17) and then closes the conriéction (lire 18). 


33 "L 
' 


HTTP Requests 


1 n 


An HTTP request consists of a request line (line 5), followed by zero or more 
request headers (line 6), followed by an empty text line that terminates the list of 
headers (line 7), A request line has the form 


method, URI version 'f a^] nU ad ; 


Ia t 
HTTP supports a number of different methods, indading GET, POST, OPTIONS, 
HEAD, PUT, DELETE, and TRACE. We will only discuss the workhorse GET 
method, which accounts for a majority of HTTP requests., The GET method 


instructs the server to generate and return the content identified by the URI 
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(uniform resource identifier). The URI is the suffix of the corresponding URL 
that includes the filename and optional arguments? 

The version field in the request line indicates the HTTP version to which the 
request conforms. The most recent HTTP version is HTTP/1.1 [37]. HTTP/1.0is an 
earlier, much simpler version from 1996 [6]. HTTP/1.1 defines additióhal headers 
that provide support for advanced features such as caching and security, as well 
as a mechanism that allows a client and server to perform multiple transactions 
over the same persistent connection. In practice, the two versions are compatible 
because HTTP/1.0 clients and servers simply ignore unknown HTTP/1.1 headers. 

To summarize, the request line in line 5 asks the server to fetch and return 
the HTML file /index.htm1. It also informs the server that the remainder of the 
request will be in HTTP/1.1 format. 

Request headers provide additional information to the server, such as the 
brand name of the browser or the MIME types that the browser understands. 
Request headers have the form 


header-name: header-data 


For our purposes, the only header to be concerned with is the Host header (line 6), 
which is required in HTTP/1.1 requests, but not in HTTP/1.0 requests. The Host 
header is used by proxy caches, which sometimes serve as intermediaries between 
a browser and the origin server that manages the requested file. Multiple proxies 
can exist between a client and an origin server in a so-called proxy chain. The data 
in the Host header, which identifies the domain name of the origin server, allow a 
proxy in the middle of a proxy chain to determine if it might have a locally cached 
copy of the requested content. 

Continuing with our example in Figure 11.24, the empty-text line in line 7 
(generated by hitting enter on our keyboard) terminates the headers and instructs 
the server to send the requested HTML file. 


HTTP Responses 


HTTP responses are similar to HTTP requests. An HTTP response consists of 
a response line (line 8), followed by zero or more response headers (lines 9-13), 
followed by an empty line that terminates the headers (line 14), followed by the 
response body (lines 15-17). A response line has the form 


version status-code status-message 


The version field describes the HTTP version that the response conforms to. 
The status-code is a three-digit positive integer that indicates the disposition of 
the request. The status-message gives‘the English equivalent of the error code. 
Figure 11.25 lists some common status codes and their corresponding messages. 


5. Actually, this is only true wlien a browser requests content. If a proxy server requests content, then 
the URI must be the complete URL. 
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+ Aside Passjrig argumehts in, HTTP PoSTT Feaquést momo Mog i | 
L 
Arguments ‘for HTTP POST requests are passed) in the request Body ier thar D the'URI: i f || 
rcs. mts m. cB “ani TEIL 5 TOO NECI IRUENEME- IR M—$ CaO EE d | | | 
* » 
hos 1 3 « | | 
Status code Status message qe Description | | 
200 OK ; ‘Regiest was handled without error. i | 
301 Moved permanently * _ Content has moved to the hostname in the Location header. i | 
400 Bad request Pai Request could not be understood by the server. 
403 Forbidden Server lacks permission to access the requested file. 


404 Not found Server could not find the requested file. 
n50} Not i implemented 2 Server does ‘hot t sppport the Tequest method. 
305 HTTP yersion not,supported « Server does not support version in Y request. 


Figure 11.25 Some HTTP status codes. 


! 
1 
I u 1 at £i | 
The response headers in lines 9-13 ES additional information. about the 
response. For our purposes, the two most important headers are Content-Type 
(line 12), which tells the client the MIME LYP of f the content į jn the response body, 
and Content-Length (line 13), which indicates its size in bytes. 
i . Theempty text line in line 14.that terminates the response.beaders is followed ; 
by the response body, which contains thé réquested content. 
| 


11.5.4 Serving Dynamic Content 


It*we stop to think for a moment how a server-might provide dynamic content 
to a client, certain questions arise. For exampléthow does the client pass any 
program arguments to the server? How does the server pass these arguments 
to the child process that it creates? How does the server pass other, jnformation 
to the child that it might need to generate the content? Where, does the child 
send its output? These questjong are addressed by a de facto standard called CGI 
(common gateway interface). trap m 


How Does the Client Pass Program Arguments to the Server? 


AXguménts for GET requests até passed in the URI. Ag we have s seen, ‘a '?' char- 
actér separates the filenarife from the argumehts, “ahd each drguriient is separated 
by an *£' character. Spaces are not allowed iri i Arguments and must be represented 
with the %20 string. Similar encodings exist for other special characters. 


How Does the Server Pass Arguments to the Child? 


After a sérver. receives a request such as 


L 
GET /cgi-bin/adder?15000&213 HTTP/1.1 
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Environment variable Description 

QUERY_STRING Program arguments 

SERVER_PORT Port that the parent is listening on 
REQUEST METHOD GET or POST 

REMOTE HOST Domain name of client 

REMOTE_ADDR Dotted-decimal IP address of client 
CONTENT_TYPE POST only: MIME type of the request body 


CONTENT_LENGTH POST only: Size in bytes of the request body 


Figure 11.26 Examples of CGI environment variables. 


it calls fork to create a child process and calls execve to run the /cgi-bin/adder 
prógram in the context of the child. Programs like the adder'program are often 
referred to as CGI programs because they obey the rules of the CGI standard. 
Before the call to execve, the child process sets the CGI environment variable 
QUERY STRING to 15000&213, which the adder program can reference at run 
time using the Linux getenv function. 


Hów Does the Server Pass Other Information to the Child? 


CGI defines a number of other environment variables that a CGI program can 
expect to be set when itruns. Figure 11.26 shows a subset. 


Where Does the Child Send Its Output? 


A. CGI program sends its dynamic content to the standard output. Before the 
child process loads and runs 4he'CGI program, it uses the Linux dup2 function 
to redirect standard output to!the connected descriptor that ‘is associated with 
the client. Thus, anything that the CGI program writes to standard output goes 
directly to the client. j 

Notice that since the parent does not know'the type or size of the content that 
the child generates, the child is responsible for generating the Content-type and 
Content-length response headers, as well as the empty line that terminates the 
headers. 

Figure 11.27 shows a simple CGI program that sums its two arguments and 
returns an HTML file with the result to the client, Figure 11.28 shows an 
transaction that serves dynamic content from the adder program. = 












In Section 10.11, we warned you about the dangers of using the C standard I/O 
functions in network applications. Yet the CGI program in Figure:11.27 is able to 
use standard I/O without any problems. Why? 


md 4:5: (solution. page. 
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code/netp/tiny/cgi-bin/adder.c 


1 #include "csapp.h" 
2 
3 int main(void) { 
4 char *buf,. *p; - 
5 char argi[MAXLINE], arg2[MAXLINE], content [MAXLINE]; 
6 int ni=0, n2z0; 
7 à 
8 /* Extract the two arguments */ 
9 if ((buf = getenv("QUERY STRING")) != NULL) 1 
10 p = strchr(buf, '&' )i " 
n *p = '\0'; 
12 strcpy (argt; buf); $) > p 
13 strcpy(arg2, pt1); 
14° | nf = atoi(arg1)}- , TUR 
15 "nt n2 = atoi(arg2); 4 "uo d 
16 } 
{7 y : 
18 /* Make the response body */* +l 2 if 
19 sprintf (contents! "QUERY_STRING=%8" , buf}; 
20 sprintf(content, "Wélcome to add.com: "J; v ' 
21 sprintf (content ,’ "%sTHE Internet addition portal.\r\n<p>", content); 
22 sprintf (content, "AsThe answer is: A4d'* Ad = %d\r\n<p>", 
23 content, ni, n2, ni + n2); 
24 sprintf(content, "%sThanks for visiting!\r\n", content); 
25 e » 
26 /* Generate the HTTP response */ : f , 
27° printf ("Connection: close\r\n") ; 
349 v ^t 
28 i printf ("Content- length: ’d\xr\n" , int) strén(content)) d 
29 printf("Content- type: text/html Nena); 
30 printf ("%s", content): 
31 fflush(stdout); 
32 
33 w exit(0); - 7 
3€ } L 
M M —— M Mà —— — code/netyftiny/cgi-bin/adder.c 


" iT Ea 4 
Figure 11.27 CGI program that sums two integers. 


tay 
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1 linux» telnet kittyhawk.chcl.cs.cmu.edu 8000 Client: open connection 

2 Trying 128.2.194.242... T? 
3 Connected to kittyhawk.cmcl.cs.cmu.edu. 

4 Escape character is '^]'. 

5 GET /cgi-bin/adder?15000&213 HTTP/1.0 Client: request line 

6 Client: empty line terminates headers 

7  HTTP/1.0 200 OK Server: response line 

8 Server! Tiny Web Server Server: identify server 

9 Content-length: 115 Adder: expect 115 bytes in response body 
10 Content-type: text/html Adder: expect HTML in response body 

11 Adder: empty line terminates headers' 

12 Welcome to add.com: THE Internet addition portal. Adder: first HTML line 

13 <p>The answer is: 15000 + 213 = 15213 Adder: second HTML line in response body 
14 <p>Thanks for visiting! Adder: third HTML line in'response body 
15 Connection closed by foreign host. Server: closes connection 

16 linux» Client: closes Connection and terminates 


Figure 11.28 An HTTP transaction that serves dynamic HTML content. 


11.6 Putting It Together: The TiNY Web Server 


We conclude our discussion of network programming by developing a small but 
functioning Web server called Tiny. Tiny is‘an interesting program. It combines 
many of the ideas that we have learned about, such as process control, Unix I/O, 
the sockets interface, and HTTP, in only 250 lines of code. While it lacks the 
functionality, robustness, and security of a real server, it is powerful enough to 
serve both static and dynamic content to real Web browsers. We encourage you 
to study it and implementit yourself. It is quiteexciting (even for the authors!) to 
point a real.browser at your own server and watch it display a complicated Web 
page with text-and graphics. 


The Tiny.main Routine 


Figure 11.29 shows Tiny’s main routine. TINY is,an iterative server that listens 
for connection requests on the port that is passed in the command line. After 
opening a listening socket by calling the open_listenfd function, Tiny executes 
the typical infinite server loop, repeatedly accepting a connection request (line 32), 
performing a transaction (line 36), and closing its énd of the connection (line 37). 


The doit Function 


The doit function in Figure 11.30 handles one HTTP transaction. First, we 
read and parse the request line (lines 11-14). Notice that we are using the rió. 
readlineb function from Figure 10.8 to read the request line. 

Tiny supports only the GET method. If the client requests another method 
(such as POST), we send it an error messabe and réturn td the main ‘routine 


x 
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a Ll code/netp/tiny/tiny.e | 
1 /x ’ 
2 * tiny.c - A simple, iterative HTTP/1.0 Web server that uses the 
3 * GET method to serve static and dynamic content Jh | 
4 */ 1 > 
5  #include "csapp.h" " ial 
6 i mE 
7 void doit(int fd); | | 
8 void read_requesthdrs(rio_t *rp); " l 1 
9 int parse üri(char *vri, char *filename, char *cgiargs); | 
10 void serve static(int fd, char *filename, int, filesize); 
11 void get filetype(char *filename, char *filetype); "S | 
12 void serve dynamic(int fd, char *filename, char *cgiargs); : 
13 void clienterror(int fd, char *cause,' char *errnum, 
14 char *shortmsg, char *longmsg); : j 
15 AS p bu 
16 int main(int argc, char **argv) * A 
17 d 
18 int listenfd, connfd; | 
19 char hostname [MAXLINE], port [MAXLINE]; T 
20 Socklen, t clientlen; - : | 
21 struct sockaddr storage clientaddr ; F ; | 
22 e : | 
23 /* Check command-line args */ o~s v | 
24 if (argc != 2) { 
25 fprintf(stderr, "usage: %s <port>\n", argv[0]); 
26 exit(1); ; 
27 } i : 
28 Eon 
29 listenfd - Üpen-listenfd(argv [11)5, " í | 
30 while (1) ( . : T | 
31 clientlen - Sizeof(clientaddr); | ] 
32 connfd = Accept(listenfd, (SA *)Eclientaddr, &clientlen); | | 
33 Getnameinfo((SA *) &clientaddr, clientlen, hostname, MAXLINE,.- i 
34 port, MAXLINE, 0); 
35 printf ("Accepted connection from (4s,° %8)\n", hostname; port); | j 
36 doit(connfd); 
37 Close(connfd) ; 
38 } 
33 } 
EE —dÀ—————— A D ONÜBBNNEENENMMEN MEER code/netp/tiny/tiny.c 


| 
Figure 11.29 The Tiny Web server. | 
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pe — —— code/netp/tinyftiny.c 


1 void doit(int fd) 

2 t 

3 int is_static; ^ 

4 struct stht sbuf; 

5 char buf[MAXLINE], method[MAXLINE], uri [MAXLINE], version[MAXLINE]; 
6 char filename[MAXLINE], cgiargs [MAXLINE] ; 

7 rio_t rio; 

8 

9 /* Read request line and headers */ . 

10 Rio readinitb(&rio, fd); 

" Rio readlineb(&rio, büf, MAXLINE); 

12 printf ("Request headers: Vi"); * 
13 printf("%s", buf); fus 
14 sscanf(buf, "%s %s %s", method, uri, version); 

15 if (strcasecmp(method, "GET")) { e 

16 clienterror(fd, method, "501", "Not implemented", 

17 "Tiny does not implement this nethod"),;. 

18 return; l 

19 } 

20 read requesthdrs(£rio); ] 
21 

22 /* Parse URI from GET request */ 

23 is static = parse uri(uri, filename, cgiargs); ^ 
24 if (stat(filename, &sbuf) < 0) {. 

25 clienterror(fd, filename, "404", "Not found" 

26 "Tiny couldn't find this file"); 

27 return; , 1 

28 H 

29 

30 if (is.static) ( /* Serve static content */ 

31 if (!(S ISREG(sbuf.st mode)) | !(S IRUSR & sbuf.st mode)) { 
32 clienterror(fd, filename, "403", "Forbidden", 

33 "Tiny couldn't read the file"); 

34 return; 

35 } 

36 serve_static(fd, filename, sbuf.st size); x 
37 } e 

38 else ( /* Serve dynamic conteht */ 

39 if (!(S ISREG(sbuf.st mode)) || !(S.IXUSR & sbuf.st mode)) { 
40 clienterror(fd, filename, "403", "Forbidden", ‘9 

41 "Tiny couldn't run the CGI program"); 

42 return; 

43 } 

44 serve_dynamic(fd, filename, cgiargs); 

z } } 4 T sal 


sis — — code/netpltinyftiny.c 


Figure 11.30 Tiny doit handles one HTTP transaction. 
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(lines 15-19), which then closes.the.connection and awaits the next connection 
request. Otherwise, we read and (as we shall see) ignore any request headers 
(line 20). i eibi 

Next, we parse the URI into a filename and a possibly empty CGI argument 
string, and we set a flag that indicates whether the request is for static or dynamic 
content (line 23). If the file does not exist on disk, we immediately send an error 
message to the client and return. ] 

Finally, if the request is for static content, we.verify that the file is a regular 
file and that we have read permission (line 31). If so, we serve the static content 
(line 36) to the client. Similarly, if the request is for dynamic coritent, we verify 
that the file is executable (line 39), and, if so, we go ahead and serve the dynamic 
content (line 44). 





The clienterror Function 
: vd 


Tiny lacks many of the error-handling features of a real server. However, it does 
check for some obvious errors and reports them to the client. The clienterror 
function in Figure 11.31 sends an HTT P response to the client with the appropriate 


à * 


hu %, 


x ae ; y: jos code/netp/tiny/tiny.c 

1 void clienterror (int fd, char *cause, char, xerrnum, 

2 char *shortmsg, char *longmsg) 

3 (t 

4 char buf[MAXLINE], body[MAXBUF]; 7" 

5 1 t i 2 + 

6, /* Build the HTTP response body */ "E 

7 y sprintf{body, "<html><pitle>Tiny Error</title?!) j« 4 

8 sprintf(body, "As«body bgcolor-""ffffff" ">\r\n", hody); 

sprintf(body, "AsAs: %s\r\n", body, errnum, shortmsg); 

10 sprintf(body, "%s<p>%s: %s\r\n", body, longmsg, cause); 

11 sprintf(body, "%s<hr><em>The Tiny Web garver</em>\r\n", body) ; 

12 ^x “ t 

13 /* Print the HTTP response */ , de 

14 - sprintf(buf, "HTTP/1.0 %s.4s\r\n", errhum, shortmsg) ; 

15 Rio writen(fd, buf, strlen(buf)); ^s 

16 sprintf(buf, "Contentrtype: text/html\r\n") ; 

17 Rio_writen(fd, -buf, strlen(buf)); 

18 sprintf(buf, "Content-length> %d\p\n\r\n", (intistrlen(body)); 

19 Rioswriten(fd,, buf,.strlen(buf)) ; TES 

20 4. «do writen(fd, body,: $trlen(body)) ;:: t 

2 ë } 

tt crode/netphtinyhiny.c 
he PE 


Figure 11.31 TINY clienterror sends an error message to the client. 
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os ——- codefnetpftiny/tiny.c 
1 void read fteduesthdrs(rio t *rp) 
2 ít 
3 char buf [MAXLINE] ; 
4 
5 Rio_readlineb(rp, buf, MAXLINE) ; 
6 while(strcmp(buf, "\r\n")) ( 
7 Rio readlineb(rp, buf, MAXLINE) ; 
8 printf("%s", buf); 
9 + 
1o return; 
"un e} : 
oo code/netpltinyltiny.c 


Figure 11.32 Tiny read, requesthdrs reads and ignores request headers. 


status códe and status message in the response line, along with an HTML file in 
the response body that explains the error to the browser’s user. 

Recall that an HTML response should indicate the size and type of the content 
in the body. Thus, we have opted to build the HTML content as a single string so 
that we can easily determine its size. Also, notice that we are using the robust 
rio_writen function from Figure 10.4 for all output. 


The read_requesthdrs Function 


Tuy does not use any of the information in the request headers. It simply reads and 
ignores them by calling the read_requesthdrs function in Figure 11.32. Notice 


that the empty text line that terminates the-requést headers consists Öf a carriage 
return and line feed pair, which we check for in line 6. 
i 


The parse_uri Function 


TINY assumes that the home directory for static content is its current directory and 
that the home directory for executables is ./cgi-bin. Any URI that contains the 
string cgi-binis:assumed to denote a request for dynamic content. The default 
filename is . /home . html. 

The parse. uri function in Figure 11.33 implements these policies. It parses 
the URI into a filename and an optional CGI argurhent string. If the request is 
for static content (line 5), we clear the CGI argument string (line 6) and then 
convert the URI into a relative Linux pathname such as «/index. html (lines 7-8). 
If the URI ends with a ‘/’ character (line 9), then we append the default filename 
(line 10). On the other hand, if the request is for dynamic content (line 13), we 
extract any CGI arguments (lines 14-20) and convert the remaining portion of the 
URI to a relative Linux filename (lines 21-22). 


et "39% ? 3 " 
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UN Um — —— ——- code/netpftiny/tiny.c 


1 int párse, uri(char *uri, char *filename, char. *cgiargs) > 
2 ( 

3 char *ptr; 

4 aM St u “a 2 7 

5 if (!strstr(uri, "cgi-bin")) { /* Static content */ 
6 Strepy(cgiargs, ""); 

7 Strcpy(filename, "."); 

8 strcat(filenamé, 'uri); 

9 if (uri[strlen(uri)-1]:9- '/') "iC 

10 strat (filename, "home.htmi"); 

1i return ‘1; BS 

12 } u 

13 else ( /* Dynamic content **/ 4 

14 ptr - index(uri, '?'); 

15 if (ptr) { 

16 strcpy(cgiargs, ptr+1); 

17 *ptr = 'N0'; M 
18 } 2 
19 v elsé . 

20 strcpy(cgiargs, ""); E 

21 strcpy(filename, "."); ' E 

22 strcat(filename, uri); 

23 return 0; 

24 } 

235 } 

M — — — — code/nelphtiny/tiny.c 


Figure 11.33 Tiny parse. uri parses an HTTP,URI. 


The serve, static Function 


Tiny serves five common types of static.content: HTML files, unformatted text 
files, and images encoded in GIF, PNG, and JPEG, formats. 

The serve_static function in Figure 11.34 sends an, HTTP response whose 
body contains the contents of a local file. First, we determine the file type by 
inspecting the suffix in the filename (line 7) and then send the.response line and 
response headers to the client (lines 8-13). Notice that a blank line terminates the 
headers. : 

Next, we send the response body by copying the contents of the requested file 
to the connected descriptor £d. The code here is somewhat subtle and needs to be 
studied carefully. Line 18 opens filename for reading and gets its descriptor. In 
line 19, the Linux mmap-function maps the requested file to a virtual memory area. 
Recall from our discussion of mmap in Section 9.8 that the call to mmap maps the 
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- code/netp/tiny/tiny.c 
1 void serve:static(int fd, char *filename, int filesize, ~ 
2 í 
3 int srcfd; 
4 char *srcp, filetype[MAXLINE], buf [MAXBUF] ; 
5 4 i 
6 /* Send response headers to client *4, toos 
7 get filetype(filename, filetype); "te 
8 sprintf(buf, "HTTP/1.0 200 OK\r\n"); 
9 sprintf(buf, "%sServer: Tiny Web Server\r\n", buf); 
10 sprintf(buf, "%sConnection: close\r\n", buf); 
11 sprintf(buf, "%sContent-length: %d\r\n", *buf, filesize); 
12 sprintf (buf, "%sContent-type: %s\r\n\r\n", buf, filetype); 
13 Rio_writen(fd, buf, strlen(buf)); 
14 printf ("Response headers: Va"); 
15 printf("%s", buf); 
16 : 
17 /* Send response body to client */ 
18 srcfd = Open(filename, O RDONLY, 0); 
19 srcp = Mmap(0, ‘filesize, PROT.READ, MAP PRIVATE, srcfd, 0); 
20 Close(srcfd); 
21 Rio_writen(fd, srcp, filesize); a > 
22 Munmap(srcp, filesize); r 
23 } 
24 
25 /* 
26 * get_filetype - Derive file type from filename 
27 */ 
28 void get filetype(char *fflename, “char *filetype) jc m 
29 1 
30 if (strstr(filename, ".html")) 
31 strcpy(filetype, "text/html"); ‘ 
32 else if (strstr(filename, ".gif")) 
33 T strcpy(tiletype, "image/gif^); ko ck : 
34 else if (strstrCfilenahe, 1 phe" 1 3 bia 
35 strcpy(£iletype, " image/png")? e PUR ER 
36 “else if (strstr(filbtame, ".jpg")) d Lu 
37 strcpy (filetype, "image/ jpeg") ; : 
38 ! else ; 
39 strcpy(filetype, "text/plain"); 
4o) ' l ; ' ad 
ee code/netp/tiny/tiny.c 


oe of 


Figure 11.341: Tiny serve_static serves static content to axlient. 
a I iv dy n 
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first filesize bytes of file srcfd to a private read-only area of virtual memory 

that starts at address srcp. 
Once we have mapped the file to memory, we no longer need.its descriptor, 

so we close the file (line 20). Failing to do this would introduce a potentially fatal 

memory leak. Line 21 performs the actual transfer of the file to the client. The. 

rio_writen function copies the filesize bytes starting at location srcp (which 

of course is mapped to the requested file) to the cliept's connected descriptor. 

Finally, line 22 frees the mapped virtual memory area. This is important to avoid 

a potentially fatal memory leak. 


TINY serves any type of dynamic content by forking a child process and then 
running a CGI program in the context of the child. 
The serve. dynamic function in Figure 11.35 begins by sending a response line 
indicating success to the client, along with an informational Server header. The 
CGI program is responsible for sending,the rest of the response. Notice that this 
is not as robust as we might wish, since it doesn't allow for the possibility that the 
CGI program might encounter some error. 
After sending the first part of the response, we, fork a new child process 


The serve_dynamic Function | | 
k 


(line, 11). The child initializes the QUERY STRING. environment variable with 


the CGI arguments from the request URI (line 13). Notice that a real server would 
^t 
q 5 RE - u 
+ S code/netp/tinyftiny.c 
1 void serve dynamic(int fd, char *filename, char *cgiargs) 
2 { 
3 char buf[MAXLINE], *emptylist[] = ( NULL ); Í 
4 
5 /* Return first part of HTTP response */ | 
6 sprintf.(buf, "HTTP/1.0 200 OK\r\n"); 
7 Rio writen(fd, buf, stslen(buf)); 1a r 
8 Sprintf(buf, "Server: Tiny Web Server\r\n"); | 
9 Rio_writen(fd, buf, strlen(buf)) ; 
10 Ei 1 
11 if (Fork() == 0) ( /*.Child/*/ «i 1 i 
12 /* Realisenver would setr‘all CGI vars here:*/ | 
13 setenv ("QUERY STRING" j'cgiargs, 1); nw ^ | 
14 Dup2(fd, STDOUTEFILEND) ; /* Redirect stdout to client */ 
15 Execve(filename, emptylist, environ); /* Run CGI:program */ | 
16 } ' | 
17 Mait(NULL); /*'*Parent'waits for and reaps child */ | 
18 3 1 
E Fost ss ius code/netp/tiny/tiny.c 
HN i» 


Figure 11.35 TINY serve dynamic serves dynamic content to a client. 


964 Chapter 11 Network Programming 


r , ai ja ! 
Ps m -— "ooa E 4 o | > ft ge xd 


* PE AS x 

Aside Dealing with prematurely closed connections : sh ei | 
Although the basic furictions of à» Webi&érver ate’ duité'simple, Wve don't want td give} gu’ the false 

impression that writing a'real"Web'sérverlis’édisy. Building/à robust Wéb'servér that turis for extendéd * 
periods without crashing'is a‘difficult task that réquités-a deeper undérstanding of Linux systefhs 1 
programming than we Ve leatnéd Here. For-exaifiple, if'a'sérVer writes id'à conriection that Has already f 
been closed by thée-clierit: (gay, bécátiSe'yoy' clicked the *Stóp" button-on-your browsér),-thei thé first: 

such write returns nofmally, Biit'thé second'wiite Causes the delivery óf-d SIGPIPE signal whose-default | 
behavior is to terminate the process. If the SIGPIPE signal is caught.or ignoréd;'theh thé’setond wiité * 
operation returns —1 with errno set to EPIPE. The strett and perror functions report the EPIPE 1 
error as a "Broken pipe,” a nonintuitive message that has confused generations of students, The bottom: $ 
‘line is that a robust Server niust catch these SIGPIPE signals and check write functiog'calls for EPIPE i 
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set the other CGI environment variables here as well. For brevity, we have omitted 
this step. 

Next, the child redirects the child’s standard output to the connected file 
descriptor (line 14) and then loads and runs the CGI program (line 15). Since 
the CGI program runs in the context of the child, it has access to the same open 
files and environment variables that existed before the call to the execve function. 
Thus, everything that the CGI program writes to standard output goes directly to 
the client process, without any intervention from the parent process. Meanwhile, 
the parent blocks in a call to wait, waiting to reap the child when it terminates 
(line 17). 


11.7 Summary 


Every network application is based on the client-server model. With this model, 
an application consists of a server and one or more clients. The server manages 
resources, providing a service for its clients by manipulating the resources in some 
way. The basic operation in the client-server modelis a client-server transaction, 
which consists of a request from a client, followed by a response from the server. 

Clients and servers communicate over a global network known as the Internet. 
From a programmer's point of view, we can think of the Internet as a worldwide 
collection of hosts with the following properties: (1) Each Internet host has a 
unique 32-bit name called its IP address. (2) The set of IP addresses is mapped 
to a set of Internet domain names. (3) Processes on different Internet hosts can 
communicate with each other over connections. 

Clients and servers establish connections by using the sockets interface. A 
socket is an end point of a connection that is presented to applications in the 
form of a file descriptor. The sockets interface provides functions for opening and 
closing socket descriptors. Clients and servers communicate with each other by 
reading and writing these descriptors. 
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Web servers and their clients (such as browsers) communicate with eacH other 
using.the HTTP protocol. A browser requests either static or dynamic: content 
from the server. A request for static content is served by fetching a'file from the 
server's disk and returning it to the client. A request for dynamic content is served 
by running a program in the context of a child process on the server and returhing 
its output to thé client. The CGI standard provides a set of rules that govern how 
the client passes program arguments to the server, how the server passes'these 
arguments and other information to the child process, and how the child sends 
its output back tg ,the client, A simple but fu ctioning Web server, that serves 
both static and dynamic content can bé implemented in a. few hundred lines of 
C code. 1 


n H 
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available numbered documents known as RFCs (requests for comments), A 
searchable index of RECs js available on the Web at j: 
oe d D ^f 
http://rfc-editor.org 5 f ! 

| RFCs are typically written, fot developers of ‘Internet’ infrastructure, Ahd thus 
| they &re usually too detailed for thé casual re&der."Howevtt, for authoritative 
information, there is no better source. The HTTP/1.1 protocbl is documented in 
RFC 2616. The authoritative list of MIME types is maintained at ? 


' ia 


http://www.iana.org/assignments/media-types 


Kerrisk is the bible for all aspects of Linux programming and’ prévides a de- 
tailed discussion of modern network programming.[62]: There’ are a number of 
good general texts on computer networking [65,-84, 114]. The great technical 
writer W. Richard Stevens developed a series of classic texts on such topics as ad- 
yanced Unix programming [111], the Internet protocols [109, 120, 107],'and Unix 
network prograniming (108, 110]. Serious students of Unix systems programming 
will want to study all of them. Tragically, Stevens died on September'1, 1999'^His 
contributions are greatly missed. 3 $ 
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11.6 99 
A. Modify Tiny so that it echoes every request line and request header: 


B. Use your favorite browser to make a request to Tiny for static content. 
Capture the output from Tiny in a file. i 
C. Inspect the output from Tiny to determine the version of HTTP your 


^ 


browser uses. : 


| 
[| 
| 
| 
1 
| E 
The official source of information for the Internet is contained'in a set of fré&ly | 
I 
| 
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D. Consult the HTTP/1.1 standard in RFC 2616 to determine the meaning of 
each header in the HTTP request from your browser. You can obtain RFC 
2616 from www.rfc-editor.org/rfc. html. 


11.7 o@ 
Extend Tiny so that it serves MPG video files. Check your work using a real 


browser. 





11.8 99 
Modify Tiny so that it reaps CGI children inside a SIGCHLD handler instead of 


explicitly waiting for them to terminate. 














119 oo 
Modify Tiny so that when it serves static content, it copies the requested file to the 
connected descriptor using malloc, rio_readn, and rio_writen, instead of mmap 
and rio writen. 






11.10 9€ 
A. Write an HTML form for the CGI adder function in Figure 11.27. Your form 
should include two text boxes that users fill in with the two numbers to be 
added together. Your form should request content using the GET method. 


B. Check your work by using a real browser to request the form from Tiny, 
submit the filled-in form to Tiny, and then display the dynamic content 
generated by adder. 










11.11 9€ 
Extend Tiny to support the HTTP HEAD method. Check your work using TELNET 


as a Web client. 








11.12 99 
Extend Tiny so that it serves dynamic content requested by the HTTP POST 
method. Check your work using your favorite Web browser. 














11.13 «99 
Modify TtNv so that it deals cleanly (without terminating) with the SIGPIPE 
signals and EPIPE errors that occur when the write function attempts to write to 
a prematurely closed connection. 


Solutions to Practice Problems 


Solution to Problem 11.1 (page 927) 


Hex address Dotted-decimal address 


0x0 0.0.0.0 
Oxffffffff 255.255.255.255 
Ox7£000001 127.0.0.1 

Oxcdbca079 205.188.160.121 
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Hex address Dotted-decimal address 


0x400c950d 64.12.149.13 
Oxcdbc9217 205.188.146.23 





Solution to Problem 11.2 (page 927) 
code/netp/hex2dd.c 





#include "csapp.h" 


1 
2 

3 ‘int main(int argc, char **argv) p : 
A d a 


‘ i 
struct in_addr:inaddr; ./* Address in network byte, order */ 











5 | 

6 uint32_t addr; /* Address in, host byte order x/ 

7 char buf [MAXBUF] ; /* Buffer for dotted-decimal string */ | 

8 

'9 if (argc != 2) { 

10 fprintf(stderr, "usage: %s «hex number>\n!", argV[0]); 

11 exit(0); 

12 } 

13 sscanf(argv[1], "Ax", &addr); 

14 inaddr.s addr = htonl(addr); | 

15 | 

16 if (!inet ntop(AF INET, &inaddr, buf, MAXBUF)) | | 

17 unix error("inet ntop"); | | 

18 printf("AZsNn", buf); 

19 7 

20 'exit(0); b : 

2 )] | 

code/netp/hex2dd.c 

Solution to Problem 11.3 (page 927) | 
: - code/netp/dd2hex.c | 

1 #include "csapp.h" | 

2 `a | 

3 int main(int argc, char *#argv) ' | 

4 { e ! : 

5 struct in_addr inaddr; /* Address in network byte order */ i 

6 int rc; 

7 

8 if (arge != 2) { ^ 

9 fprintf(stderr, "lusage:. 4s <dotted-decimal>\n", arBv[0]); 

10 exit (0); Bo 

n } i 

12 » 

13 rc = inet pton(AF INET, argv[1], &inaddr); 


14 if (rc == 0) 
15 app.error("inet pton error: invalid dotted-decimal address"); i 
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16 else if (rc « 0) i 
17 unix_error("inet_pton error"); 
18 
19 printf ("Ox%x\n", ntohl(inaddr.s_addr)); ^ 
20 exit(0); i 
21 X 
tode/netp/dd2hex.c 
Solution to Problem 11.4 (page 942) E A 


Here's a solution. Notice how much more difficult it is to use inet -ntop, which 
requires méssy casting" dnd deep structuré references. The getnameinto function 
is much simpler Because it'dóes all of that work for us. 


e 





code/netp/hostinfo-ntop.c 


#include "csapp.h" 
int main(int argc, char **argv) 
{ 
struct addrinfo *p, *listp, hints; 
struct sockaddr_in *sockp; 
char buf [MAXLINE] ; " 
int rc; 


if (argo != 2) { ` 
fprintf(stderr, "usage: 4s «domain name>\n", argv[0]); 
exit (0); 

} 


/* Get a list of addrinfo records */ 

memset (khints, 0, *Sizeof (struct addrinfo)); 

hints.ai_family = AF_INET; /* IPv4 only */ asy 

hints.ai socktype = SOCK STREAM; /* Connections only */ 

if ((re = getaddrinfo(argv[1], NULL, hints, £listp)) !=,.0) { 
fprintf(stderr, "getaddrinfo error: %s\n", gai_strerror(rc)); 
exit(1); z 

} 


/* Walk the list and display each associated IP address */ 
for (p listp; p; p 5 p->ai_next) { rəf 


sockp = (struct sockaddr in *)p->ai_addr; ^ 
Inet_ntop(AF_INET, &(sockp->sin_addr), buf, MAXLINE) ; 
printf ("%s\n", buf); $ 

} a à 


31 
32 
33 
34 
35 


Solutions to Practice Problems 


/* Clean up */ 
Freeaddrinfo(listp); 


exit(0); 
} 


sss code/netp/hostinfo-ntop.c 


Solution to Problem 11.5 (page 954) 
The reason that standard VO works in CGI programs is that the CGI program 
running in the child process does not need to explicitly close any of its input 


or output streams. When the child terminates, the kernel closes all descriptors 
automatically. j L 
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A: we learned in Chapter 8, logical control flows are concurrent if they overlap 
in time. This general phenomenon, known as concurrency, shows up at many 
different levels of a computer system. Hardware exception handlers, processes, 
and Linux signal handlers are all familiar examples. 

Thus far, we have treated concurrency mainly as a mechanism that the oper- 
ating system kernel uses to run multiple application programs. But concurrency is 
not just limited to the kernel. It can play an important role in application programs 
as well. For example, we have seen how Linux signal handlers allow applications 
to respond to asynchronous events such as the user typing Ctrl+C or the program 
accessing an undefined aréa of virtual memory. Application-level concurrency is 
useful in other ways as well: 


* Accessing slow I/O devices. When an application is waiting for data to arrive 
from a slow I/O device such as a disk, the kernel keeps the CPU busy by 
running other processes. Individual applications can exploit concurrency in a 
similar way by overlapping useful work with I/O requests. 


¢ Interacting with humans. People who interact with computers demand the abil- 
ity to perform multiple tasks at the same time. For example, they might want 
to resize a window while they are printing a document. Modern windowing 
systems use concurrency to provide this capability. Each time the user requests 
some action (say, by clitking the mouse), a separate concurrent logical flow is 
created to perform the action. 


* Reducing latency by deferring work. Sometimes, applications can use concur- 
rency to reduce the latency of certain operations by deferring other operations 
and performing them concurrently. For example;a dynamic storage allocator 
might reduce the latency of individual free operations by deferring coalesc- 
ing to a concurrent “coalescing” flow that runs at a lower priority, soaking up 
spare CPU cycles as they become available. 





* Servicing multiple network clients. The iterative network servers that we stud- 
ied in Chapter 11 are unrealistic because they can only service one client at 
a time. Thus, a single slow client can deny service to every other client. For a 
real server that might be expected to senvice hundreds or thousands of clients 
per second, it is not acceptable to allow one slow client to deny service to the 
others. A better approach is to build a’¢oncutrent server that creates a separate 
logical flow for each client. This allows the server to service multiple clients 
concurrently and precludes slow clients from monopolizing the server. 


* Computing in parallel on multi-core machines. Many modern systems are 
equipped with multi-core processors that contain multiple CPUs, Applica- 
tions that are partitioned into concurrent flows often run faster on multi-core 
machines than on uniprocessor machines because the flows execute in parallel 
rather than being interleaved. 


Applications that use application-level concurrency are known as concurrent 
programs. Modern operating systems provide three basic approaches for building 
concurrent programs: 
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e Processes. With this approach, éach logical control flow is a process that is 
scheduled and maintained by the kernel. Since processes have separate virtual 
address spaces, flows that want to communicate with each other must use some 
kind of explicit interprocess communication (IPC) mechanism. 


* V/O multiplexing. This is a form of concurrent programming where applications 
explicitly schedule their own logical flows in the context of a single process. 
Logical flows are modeled as state machines that the main program explicitly 
transitions from state to state as a result of data arriving orf file descriptors. 
Since the program is a single process, all flows share the same'address space. 


* Threads. Threads are logical flows that run in the context of a single process 
and are scheduled by the kernel. You can think of threads as a hybrid of the 
other two approaches, scheduled by the kernel like process flows and sharing 
the same virtual address space like I/O multiplexing flows. 


This chapter investigates these three different concurrent programming tech- 
niques. To keep our discussion concrete, we will work with the same’ motivating 
application throughout—a concurrent. version of the iterative echo server from 
Section 11.4.9. j 


m 


12.1 Concurrent Programming with’ Processes 


The simplest way to build a concurrent program is with proéesses, using familiar 
functions such as fork, exec, and waitpid. For example, a natural approach for 
building a concurrent server is to accept client connection requests in the parent 
arid then create a new child process to service each new client. , 

To sée how this might work, suppose we have two,¢lients, and a server that is 
listening for connection requests on a listening descriptor,(say, 3)..Now suppose 
that the server accepts a connection request from client 1 and returns a connected 
descriptor (say, 4), as shown in Figure 12,1. After accepting the connection request, 
the server forks a child, which gets a complete copy.of the server’s descriptor table. 
The child closes-its copy of listening déscriptor 3; and the parent closes its copy of 
connected descriptor 4, since they are no longer needed, This gives us the situation 
shown in Figure 12.2, where the child process is busy servicing the client. 

Since the connected descriptors in the parent and child each- point tq the 
same file table entry, it js crucial for the parent to close its copy of the connected 
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Figure 12.2 
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child process to service 
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Figure 12.3 
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descriptor. Otherwise, the file*table entry for connected descriptor 4 will never 
be released, and the resulting memory leak will eventually consume the available 
memory and crash thé syStém. : 

Now'suppose:that"after the parent creafes the child’ fóP clietit 1, it accepts 
a newconnection request from lient 2 and returns a‘new connected désériptor 
(say, 5), as shown in Figure £2.3. The parent then forks anóther child, which begins 
setvicing its client using cóhlíected déscriptor 5, as shown it Figure 12.4. At this 
poiüt, the‘ parent is waiting foi'the next'connection request’ and ‘the two children 
arë servicihg their respective cliéhts concurréhtly. i * 

“ g 4 


12.1.1 A Cdcürtent Server Based on Processes 


Figure 12.5 shows the code for a concurrent echo server based on processes. 
The echo function called in line 29 comes from Figure 11.22. There are several 
important points to make abóut this server:, 


* First;,'servers ‘typically run for long periods of time, so we must include a 
SIGCHLD handler that reaps zombie children (lines 4-9). Since SIGCHLD 
signals are blocked while the SIGCHLD handler is executing, and since Linux 
signals are not queued, the SIGCHLD.handler must be prepared to reap 
multiple zombie children. 

* Second, the parent and the child must close their respective copies of connfd 
(lines 33 and 30, respectively). As we have mentioned, this is especially im- 
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portant for the parent, which must close its copy of the connected descriptor 
to avoid a meinory leak. 


* Finally, because of the reference count in the socket's file table entry, the, 4 
connection to the client will not be terminated,until both the parept'ssand 
child's copies of connfd are closed. 


12.1.2 Pros and Cons of Processes 


Processes have a clean model for sharing state information:between.parents and 
children: file tables are shared and user address spaces are not. Having separate 
address spaces for processes is both an advantage and a disadvantage. It is im- 
possible for one process to accidentally.overwrite the virtual memory of another a 
process, which eliminates a lot of confusing failures—an obvious advantage. 

On the other hand, separate address spaces. make it more difficult for pro- 
cesses to share state information. To share.information, they must.use explicit 
IPC (interprocess communications) mechanisms. (See the. Aside on page 977.) 
Another disadyantage of process-based designs is that they tend to be slower be- 
cause the overhead for process control and IPC is high. 





After the parent nem the Mme desit inline 33 ofthe concurrent server 
in Figure 12.5, the child is still able to communicate with the client using its copy 
ofthe descriptor. Why? ^ Do pA M ‘ 





Ifs we were 510 5 delete liné 30 of Figure 12. 5, which closes the connected descriptor, 
the code would still be correct, in the sense that there would be no memory leak. 
Why? 
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: : code/conc/echoserverp.¢ 
1 #include "csapp.h" b os ' £ 
2 void echo(int connfd); 
3 
4 void sigchld handler(int sig) | 
5 1 ad 
6 while (waitpid(-1, 0, WNOHANG) > 0) 
7 ; 
8 return; 
9 } ` 
10 
1! int main(int argc, char **argv) 
2 1 
13 int listenfd, connfd; 
14 socklen t clientlen; 
15 struct sockaddr Storage elientaddr; : i , > 
16 E 
17 if (argc != 2)-{ " 
18 fprintf(stderr} "usage! 4s «port?Wn",:argv([0]); i 
19 exit(0); 
20 } 
21 pe 
22 Signal(SIGCHLD, sigchld handler); 
23 listenfd = Üpen listenfd(argv[11); eo 
24 while (1) 4, BS 
25 clientlen = sizeof(5truct sockaddr storage); 26 
26 connfd - Accept autenta (SA *) &cliențaddr ,„&clientlen); 
27 if (Fork() == Oami i er i 4 
28 Close(listenfd); /* Child closes its listening socket */ 
29 echo(connfd); /* Child'services client */ 
30 Close(connfd); /* Child closes connection; with client */ Yd. 
31 exit(0); /* Child exits */ 5 25 
32 } ! ») di, 7 
33 Close(connfd); /* Parent closes connected socket (important!) */ 
34 } 
35} : 
ae céde/conclechioserverp. c 
wpa woof " if 1 


Figure 12.5 Concurrent echo server based on processes. The parent forks a childito handle each new 
connection request. 
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12.2 Concurrent Programming with.1/O Multiplexing 


Suppose you are asked to write an echo server that can also respond o interactive 
commands that the user types to standard input. In this case, the server must 
respond to two independent I/O events: (1) a network client making a connection 
request, and (2) a user typing a command ‘line at, the keyboard. Which event 
do we wait for first? Neither option i is ideal. If we are waiting for a connection 
requést i in accept, then we cannot respond to input commands. Similarly, if we are 
waiting for an input command in read, then we cannot respond to any connection 
requests, 

One solution to, this dilemma is a technique called VO multiplexing. The basic 
idea is to use the Se ct function to ask the kernel to suspend the process, return- 
ing control to the üpplication only after one or more I/O events have occurred, as 
in the following examples: 













* Return when any descriptor in the set (0, 4} is ready for reading. 
* Return when any descriptor in the set (1, 2, 7) is ready for writing. 
* Time out if 152.13 seconds have elapsed waiting for an I/O event to occur. 


Select is a complicated function with many different usage scenarios. We 
will only discuss the first scenario: waiting for a set of descriptors to be ready for 
reading. See [62, 110} for a complete discussion. 


#include «sys/select.h» 















int select(int n, fd set *fdset, NULL, NULL, NULL); 
Returns: nonzero count of ready descriptors, —1 on error 






FD ZERÜ(fd set *fdset) ; /* Clear all; bits in fdset'*/ 
FD Gih(int fd, fd set *fdset) ; /* Clear bit fd im fdset */ 
FD SET(int fd, fd. "det *fdset); /* Turá on bit fd in fdset */ 


FD ISSET(int fd, fd set *fdset); /* Ib “Sit fd in fdset on? */ 
Macros for manipulating descriptor sets 
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The select function manipulates sets of type £d set, which are known as de- 
scriptor sets. Logically, we think of a descriptor set as a bit vector (introduced in 
Section 2.1) of size n: 


baa tees b bo i 


Each bit b; corresponds to descriptor k ‘Descriptor k is a member of the descriptor 
set if and only if b, = 1. You are only allowed to do three things with descriptor 
sets: (1) allocate them, (2), assign one variable of this type to another, and, (3) 
modify and inspect them using the FD ZERO, FD SET, FD CLR, and FD. 
ISSET macros. 

For our purposes, the select function takes two inputs: a descriptor set 
(fdset) called the read set, and the cardinality (n) of the read set (actually the 
maximum cardinality of any deScriptor set). The select function' blocks until at 
least one descriptor in the read set is ready for reading. A descriptor k is ready 
for reading if and Only if'a request to read 1 byte from that descriptor would not 
block. As a side effect, select modifies t le fd_set pointed to by argument fdset 
to indicate a subset of the read set called the read y set, consisting of the descriptors 
in the read set that are ready for reading. The value returned’ by the function 
indicates the cardinality of the ready set. Note that because of the side effect, we 
must update the. read set every time select is called. 

The best way to understand select is to study? à concrete example. Figure 12.6 
shows how we might use select to implement an iterative echo server that also 
accepts user commands on the standafd input. We begin ‘by using the open_ 
listenfd function from Figure 11.19 to open a listening descriptor (line 16), and 
then using FD. ZERO to create an enipty read set (line 18): 


listenfd stdin 


3 2 1 0 


reas set @:[ 0 [ » | * ] * ] 


Next, in lines 19 and 20, we define,the read set to consist of descriptor 0 (standard 
input) and descriptor 3 (the listening descriptor), respectively: 


listenfd stdin 


3 2 1 0 


read. set ({0,3}): EE AEN EE: d 


At this point, we begin the typical server loop. But instead of waiting for a 
connection request by calling the accept function, we "call the select function, 
which blocks until either the listening descriptor or, standard input is ready, for 
reading (line 24). For example, here i is the value, of ready,set that select would 
return if the user hit the enter key, thus causing the standard input descriptor to 
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+ cotle/conc/select.c 
1 #include "csapp.h" 
2 void echo(int connfd); 
3 void command(void); 
4 
5 int main(int argc, char **argv) ! 
6 í 
7 int.listenfd, ,.connfd; 3 
8 socklen_t clientlen; 1 
9 Struct sockaddr storage clientaddr; "T 
10 fd.set read; set, ready_set; 
t 4 - 7 3 pak i 
12 if’(atge !5 2) { Z ý 
13 fprintf(stderr, "usage: %s"<port>\n", argv{0]); i 
14 exit(0); 
15 n y} h ? F T 
16 listenfd = Open,listenfd(argv (11); 
17 i 3 ye 
J8 .  .FD_ZERO(&read_set) ; /* Clear read set */ 
1% ,, FD.SET(STDIN FILENO, Bread»set);. /* Add stdin tp read set */ 
20 FD.SET(listenfd, gread set); /* Add listenfd to,read; set */ 
21 
22 while (1) í 
23 l ready.set = read set; 
24 Select (listenfdti, &ready set, NULL, NULL, NULL); 
25 if (FD_ISSET(STDIN_FILENO, ateady Set) 
26 command(); /* Read command line fron'stdin!x/ ' 
27 if (FD ISSET(listenfd, &ready set)) { 
28 clientlen = sizeof(struct sockaddr storage); 
29 connfd = Accept(listenfd, (SA *)&clientaddr, &clientlen); 
30 g echo(connfd) ; /* Echo client input, until EDF, «/ f 
31 Close (connfd) ; 
X. nu H è ie 
33 H : 
34 } 
35 
36  yoid command (void) { i 
37 char buf [MAXLINE] ; 
38 if (+F¥gets (buf, ‘MAXLINE, stdin) ) 
39 exit (0) ;"/*- EOF */ 7s 
40 printi("%s", buf); /* Process ‘thet input command */ 
44 } : : 
code/conc/select.c 





Mai 3 
Figure 12.6 An iterative echo server that uses 1/O multiplexing. The server uses 


select to wait for connection requests on à listening descriptor and commands on 
standard input. 


"8 
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become ready for reading: 


listenfd stdin 


3 2 1 0 


ready_sat ({0}): BCNE BECNNMUM 


Once select returns, we use the FD ISSET macro to determine which de- 
scriptors are ready for reading. If standard input is ready (line 25), we call the 
command function, which reads, parses,'and responds to the command before re- 
turning to the main routine. If the listening descriptor is ready (line 27), we call 
accept to get a connected descriptor and then call the echo function from Fig- 
ure 11.22, which echoes each line from the client until the client closes its end ot 
the connection. 

While this program is a good example of using select, it stillleaves something 
to be desired. The problem is that once it connects to a client, it continues echoing 
input lines until the client closes its end of the connection. Thus, if you type a 
command-to standard input, you will not get a response until the server is finished 
with the client.-A better approach would' be to multiplex at a finer'granularity, 
echoing (at most)-one fext line each time through the server loop. 





ia Linux pou "ons CtrlaD indicates EOF Sata) input; What cru 
if you type Ctrl+D to the program in Figure 12.6 while it is blocked in the call to 
select? 


12.2.1 A Concurrent Event-Driven Server Based on [/O Multiplexing 


I/O multiplexing can be used as the basis for concurrent event-driven programs, 
where flows make progress as a result of certain events. The general idea is to 
model logical flows as state machines. Informally, a state machine is a collection of 
states, input events, and transitions that map states and input events to states. Each 
transition maps an (input state, input event) pair to an output state. A self-loop is 
a transition between the same input and output state. State machines are typically 
drawn as directed graphs, where nodes represent states, directed arcs represent 
transitions, and arc labels represent input events. A state machine begins execution 
in some initial state. Each input event triggers a transition from the current state 
to the next stafe. 

For each new client k, a concurrent server based on I/O multiplexing creates 
a new státe machine s, and ásbociates it with connected'descriptoid,. As shown 
in Figure 12.7; each state‘machine s, has one state ¢“Waiting for descriptor di to 
be ready for reading”), one input event (“descriptor d, is ready for réading”), ‘and 
one'transition (“read a text line from descriptor dg”). 

















Figure 12.7 Transition: 
State machine for “read a text line from 
a logical flow in a Input event: descriptor d; 

à "descriptor d, "dd 
concurrent event-driven is ready for reading" . 


echo server. 


Ò 







State: 
“waiting for descriptor d, to, 
_,be ready for reading" 


The server uses the I/O multiplexing, courtesy of the select function, to 
detect the occurrence of input events. As each connected descriptor becomes 
ready for reading, the server executes the transition for the corresponding state 
machine—in this case, reading and echoing a text line from the descriptor. . 

Figure 12.8 shows the complete example code for a concurrent event-driven 
server based on I/O multiplexing. The set of active clients is maintained in a pool 
structure (lines 3-11). After initializing the pool by calling init. pool (line 27), 
the server enters an infinite loop. During each iteration of this loop, the server calis 
the select function to detect two different kinds of input events: (1) a connection 
request arriving from a new client, and (2) a connected descriptor for an existing 
client being ready for reading. When a connection request arrives (line 35), the 
server opens the connection (line 37) and calls the'add, client function to add the 


client to the pool (line 38). Finally, the server calls the check. clients function to’ 


echo a single text line from each ready connected descriptor (line 42). 

The init. pool function (Figure 12.9) initializes the client pool. The clientfd 
array represents a set of connected descriptors, with the integer —1 denoting an 
available slot. Initially, the set of connected déscriptors is empty (lines 5-7), and 
the listening descriptor is the-only descriptor in the select read set (lines 10-12). 

The add, client function (Figure 12.10) adds a new client to the pool of active 
clients. After finding an empty slot in the clientfd array, the server adds the 
connected descriptor to the array and initializes a corresponding Rio read buffer 
so that we can call rio readlineb ón'the descriptor (lines 8-9). We then add 
the connected descriptor to the select read set (line 12), and we update some 
global properties of the pool. The maxfd variable (lines 15-16) keeps track of the 
largest file descriptor for select. The maxi variable (lines 17-18) keeps track of 
the largest index into the clientfd array so that the check, clients function does 
not have to search the entire array. 


The check :clients.function in Figure 12.11 echoes a text line from each, 


ready connected descriptor. If we are successful in reading a text line from the 
descriptor, then we echo that line back to the client (lines 15-18). Notice that in 
line 15, we are maintaining a cumulative count of total bytes received from all 
clients. If we detect EOF because the client has closed its end of the connection, 
then we close our end of the connection (line 23) and remove the descriptor from 
the pool (lines 24-25). $ 
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^id 


code/conc/echoserVers.c 
4 


#include "csapp.h" 1 
‘N 

typedef struct { /* Represents a pool. of’ connected descriptors */ 
int maxfd; /* Largest descriptor im read. set */ 
fd set read set; /* Sét of all active descriptors '*/ 
fd set ready set; /* Subset of descriptors ready for reading */ 
int nready; /* Number of ready déscriptors from select */ 
int maxi; /* High water index into client array */ 
int clientfd[FD SETSIZE]; /* Set of active descriptors */ 
rio.t clientrio[FD SETSIZE]; /* Set of active read buffers */ 

) pool; 


int byte cnt = 0; /* Counts total bytes received by server */ 
fi 


Ya i? 


int main(int argc, charl**argwv) 


ab o uh mà d mh d ho d Y 
(o Q X400 d ww hl OMAN AW PWD = 


1 a 3 
int listenfd, connfd; t 3 ` 
socklen_t clientlen; 
struct sockaddr_storage clientaddr; 1 ¢ a à. 


20 static pool pool; - i d 

21 $ I, 

22 if (argc !s 2) ( AP i 

23 fprintf(stderr, "usage: %s <port>\n", argv[0]); 

24 exit (0); 

25 } JoH | 

26 listenfd = Opeh_ Listenfd (argv [1]) ; id i 

27 init pool(listenfd, poo); í 

28 E 

29 while (1) { ! : 4 1 ib 

30 /* Wait» for listéning/connected descriptor(s) td hacone ready */ oo 

31 pool. ready_set = pool.read set; n 

32 pool.nready = Select (pool -maxfd+1, &pool.ready set, NULL, NULL, NULL); 

33 ; ; i 

34 /* If listening descriptor ready, add new client to, pool */ 

35 if (FD_ISSET(listenfd, &pool.ready set)) (, 

36 clientlen = sizeof (struct sockaddr_ storage) ;~ 

37 connfd = ! Accept (listenfd, (SA X)&clientaddr, &clientlen) ; E 
38 add, client(connfd, &pobl); . n 
39 } P en 

40 L7 

41 /* Echo a text'line ftom each ready connected descriptor” */ 

42 check. clients(&pool);, ' he a t LU 
43 } t yt ¥ J 

44  ) 3 

M i r i 3 vode/conc/echoserverst 


^ A stot iD & 1 ta 
Figure 12.8 Concurrent echo server based on l/O multiplexing. Each server iteration echoes a text line 
from each ready descriptor. 








1 void init pool(int listenfd, pool xp) È 
2 i 

3 /* Initially, there are no connected descriptors +/ 

4 int i; 

5 p-?^maxi = -1; 

6 for (i-0; i< FD .SETSIZE; i++) 

7 p-?clientfd[i] = -1; 

8 i 

9 /* Initially, listenfd is only member of select,read set */ 
10 p->maxfd = listenfd; 

11 FD. ZERO (£p-^read set); =+ 

12 FD_SET(listenfd, &p-»read set);, : 

B } 


m ‘a code/conc/echoservers. c 


Figure 12.9 init_pool initializes the pool of active, clients. 


ge code/conc/echoservers.c 


1 void add_client(int connfd, pool *p) 

2 q 

3 int i; 

4 p->nready--; i»: 
5 for (i = 0; i < FD SETSIZE; i++) /* Find an available siót */ 
6 if (p->clientfd[i] « 0) { 

7 /* Add connected descriptor to the pool */ 

8 p->clientfd[i] = connfd; 

9 Rio_readinitb(&p->clientrio[i], conníd); 

10 ¢ 

11 /* Add the descriptor to descriptor set */ 

12 FD_SET(connfd, &p-»read set); 

13 

14 /* Update max destriptor and poot high water mark */ 
15 if (connfd > p-»maxfd) 

16 p-?maxfd = connfd; 

17 if” (i'> p-»maxi) “hata 

18 fa" p-Smaxi = Í; í 

19 break; Hei d 2 f 
20 } 1 wv od Xl n 

21 if (i == FD SETSIZE) /* Couldn't find an empty slot x/-! 

22 app.error("add client error: Too many clients"); 

23 


a M aaaea code conch echoservers, c 
& "oJ > j 


Figure 12.10 add, client ddds a new cliébt connection to the pool. 


ae gion ee IS aE es 


pu 


hon nri ee em SA manet, aa 
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i code/conc/echoservers.c 
1 void check, clients(pool *p) d 
2 t 
3 int i, connfd, n; 
4 char buf [MAXLINE]; 
5 rio. t rio; 
6 
7 for (i = 0; (i <= p->maxi) && (p-»nready > 0); i++) { 
8 connfd = p->clientfd[i] ; 
9 rio = p-»clientrio[i]; 
10 
n /* If the descriptor is ready, echo a text line from it */ 
12 if ((connfd > 0) && (FD ISSET(connfd, &p-»ready set))) i 
13 p-^^nready--; 
14 if ((n = Rio readlineb(&rio, buf, MAXLINE)) != 0) ( 
15 byte cnt += n; 
16 printf("Server received %d (%d total) ‘byted‘ion fd %d\n", 
17 n, byte cnt, connfd); 
18 Rio writen(connfd, buf, n); 
19 F} 
20 
21 /* EOF detected, remove descriptor from pool */ 
22 else { 
23 Close(connfd) ; 
24 FD_CLR(connfd, &p->read_set) ; 
25 p->clientfd[i] = -1; 
26 } 
27 + 
28 } 
2 } 


code/conc/echoservers.c 





Figure 12.11 check_clients services ready client connections. 
" Y ryt 


In terms of the finite state model in Figure 127, the select function detects 
input events, and the add_client function creates a new logical flow (state ma- 
chine). The check. clients function performs state transitions by echoing input _ 
lines, and it also deletes the state machine when the client has finished sending 
textlines. , ; 





In the server in Figure 12.8, we are careful to reinitialize the pool -ready_set 
variable immediately;before every call to select, Why? : 














i 
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Aside. : -Event-driven Web SERIES ve r r x 


Ad * ook 


l Despite, the digadVaritages outlined in Settion 173? Yrmóderrrhigh- pemi Setvers such as Node. js k 
nginx; “and? Tofnado'use event* ‘driven prograrhming based'on Yo multiplexing;- riiainly because of the 


paineis pétformance-advanitdge‘comijjared t to processésiand thredds? “8 
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J 
12.2.2. Pros and Cons of I/O Multiplexing  ; 


The server in Figure 12.8 provides a nice example of the advantages and disad- 
vantages of event-driven f programming based on 1/0 multiplexing. One advantage 
is that event-driven designs give programmers more control over the behavior « of 
their programs than process- -based designs. For example, we can imagine writ- 
ing an event-driven concurrent server that gives preferred service to some clients, 
which would be difficult for a concufrent server based on processes. 

Another advantage is that an event-driven server based on I/O multiplexing 
runs in the context of a single process, and thus every logical flow has access to 
the entire address space of the process. This makes it easy to share data between 
flows. A related advantage of running as a single process 1s that you can debug 
your concurrent server as you would any sequential program, using a familiar 
debugging tool such as cpp. Finally, event-driven designs are often significantly 
more efficient than process-based designs because, they.do not require a process 
context, switch to séhedule a new flow. 

A significant disadvantage of event-driven designs is coding complexity. Our 
event-driven, concurrent echo server requires three times more code than the 
process-based server. Unfortunately, the complexity increases as the granularity 
of the concurrency decreases. By granularity, we mean the number of instructions 
that each logical flow executes per time slice. For instance, jn our example concur- 
rent server, the granularity of concurrency is the number of instructions required 
to read an entire text line. As long as some logical flow is busy reading a text Iine, 
no other logical flow can make progress. This is fine for our example, but it makes 
our event-driven server vulnerable to a malicious client that sends only a partial 
text line and then halts. Modifying an event-driven server to handle partial text 
lines is a nontrivial task, but it is handled cleanly and automatically by a process- 
based design. Another significant disadvantage of event-based designs is that they 
cannot fully utilize multi-core processors. 


x 


12.3 Concurrent Programming with Threads 


To this point, we have looked at two approaches for creating concurrent logical 
flows. With the first approach, we use a separate process fór each flow. The kernel 
schedules each process automatically, and each process has its own private address 
space, which makes it difficult for flows to share data. With the second approach, 
we create our own logical flows and use I/O multiplexing to explicitly schedule 
the flows. Because there is only one process, flows share the entire address space. 
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This section introduces a third approach—-based on threads—that is a hybrid of 
these two. 

A thread is a logical flow that runs in the context of a process. Thus far 
in this book, our programs have consisted of a single thread per process. But 
modern systems also allow us to write programs that have multiple threads running 
concurrently in a single process. The threads are scheduled automatically by the 
kernel. Each thread has its own thread context, including a unique integer thread 
ID (TID), stack, stack pointer, program counter, general-purpose registers, and 
condition codes. All threads running in a process share the entire virtual address 
space of that process. 

Logical flows based on threads combine qualities of flows based on processes 
and I/O multiplexing. Like processes, threads are scheduled automatically by the 
kernel and are known to the kernel by an integer ID. Like flows based on I/O 
multiplexing, multiple threads run in the context of a single process, and thus they 
share the entire contents of the process virtual address space, including its code, 
data, heap, shared libraries, and open files. 


12.3.1 Thread Execution Model 


The execution model for multiple threads is similar in some ways to the execution 
model for multiple processes. Consider the example in Figure 12.12. Each process 
begins life as a single thread called the main thread. At some point, the main thread 
creates a peer thread, and from this point in time the two threads run concurrently. 
Eventually, control passes to the peer thread via a context switch, either because 
the main thread executes a slow system call such as read or sleep or because it 
is interrupted by the system's interval timer. The peer thread executes for a while 
before control passes back to the main thread, and so on. 

Thread execution differs from processes in some important ways. Because a 
thread context is much smaller than a process context, a thread context switch is 
faster than a process context switch. Another difference is that threads, unlike pro- 
cesses, are not organized in a rigid parent-child hierarchy. The threads associated 


Figure 12.12 Time 
Concurrent thread Thread 1 Thread 2 
execution. (main thread)| (peer thread) 


} Thread context switch 





} Thread context switch 


} Thread context switch 
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with a process form a pool of peers, independent of which threads were created 


by which other threads. The main thread is distinguished from other threads only i 
intthe sense that it is always the first thread to run in the process. The màin impact 

of this notion of a pool of peers is that a thread can kill, any of its peers or wait | l 
for any of its peers to terminate. Further, each peer cañ read and write the same 

shared data. 3 e 


F 1* 


1 232" Posix Threads 


E 


Posix threads (Pthreads) is a standard intérface for manipulatirig threads‘from C 
programs. It was adopted in.1995 and is available on ill Linux systems. Pthreáds i 
defines about 60 fuüctions that allow progrtins-to create, kill,'and reap“threads, 
to share data' safely with peer threads, and to nótify peets about chiahgés in the 
system state. - Yit 
Figure 12.13 shows a simple Pthreads program. The main thread creates a peer 
thread and then waits for it to terminate. The peer thread pritits Hello, world!Mà 
and terminates. When the main thread detects that the peer thread has terminated, | 
it terminates'the process by calling exit. This is the first thredded program ‘we 
have seen, so let us dissect it carefully. The code and Jocal data for a thread are 
encapsulated in a thread routine. As shown by the prototype in line 2, each,thread | 


routine takes as input a single generic pointer, ang.returns a generic pointer. If | 
you want to pass multiple arguments to a thread routine, then you should put the 
arguments into a structuré and pass 4 pointer to the structure. Similarly; if you 
£ i 
i | 
— aaam code/conc/hello.c | | 
1  finclude, "csapp.h" a : i 
2 void *thread(void *vargp) ; i 
a 3 ' | 
4 int main() 
5 f y : | | 
6 pthread_t tid; 4 i v ý : 
7 Pthread_create(&tid, NULL, thread, NULL); - 
8 Pthread, join(tid, NULL); 
9 exit(0);; + | | 
10 ) 
11 | 
12 void *thread(void *vargp) /* Thread routine */ l 
io : 
14 printf("Hello, world!in"); | 
15 return NULL; i 
166 ) i i 


Mossa eode/conc/hello.c 
Figure,12.13 hello.c: The Pthreads “Hello, world!" program.. 


We Mapas eai 
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want the thread routine to return multiple arguments, you can return a pointer to 
a structure. 

Line 4 marks the beginning of the code for the main thread. The main thread 
declares a single local variable tid, which will be used to store the thread ID of 
the peer thread (line 6). The main thread creates a new peer thread by calling the 
pthread_create function (line 7). When the call to pthread_create returns, the 
main thread and the newly created peer thread are running concurrently, and tid 
contains the ID of the new thread. The main thread waits for the peer thread to 
terminate with the call to pthread_join in line 8. Finally, the main thread calls 
exit (line 9), which terminates all threads (in this case, just the main thread) 
currently running in the process. 

Lines 12-16 define the thread routine for the peer thread. It simply prints a 
string and then terminates the peer thread by executing the return statement in 
line 15. 


12.3.3 Creating Threads 


Threads create other threads by calling the pthread_create function. 


#include’<pthread.h> 
typedef’ void *(func) (void *); 


int pthread create(pthread t *tid, pthread attr t *attr, 
func *f, void *arg); 
Returns: 0 if OK, nonzero on error 





The pthread. create function creates a new thread and runs the thread routine f 
in the context of the new thread and with an input argument of arg. The attr 
argument can be used to change the default attributes of the newly created thread. 
Changing these attributes is beyond our scope, and in our examples, we will always 
call pthread, create with a NULL attr argument. 

When pthread create returns, argument tid contains the ID of the newly 
created thread. The new thread can determine its own thread ID by calling the 
pthread_seif function. 








#include <pthread.h> 


pthread_t pthread_self (void); 


Returns: thread ID of caller 





12.3.4 Terminating Threads 
A thread terminates in one of the following ways: 


* The thread terminates implicitly when its top-level thread routine returns. 





+The thread terminates explicitly by calling thet pthread_exit function If 
the main thread calls pthread_exit, it waits.for all other peer threads td 
r terminate and then terminates the main thread and the entire processiwith 
a.return value of thread, téturn. 


k 
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| #include <pthread.h> 


: void pthread_exit(void *thread, return); a 


„Never returns 
i 





4 
E] 
3 * Some peer thread calls the Linux exit function, which terminates the process 
| and all threads associated MIR the process. px 


* Another peer thread termiriátes the current thread by calling the pthread_ 
cancel function with the ID of the current thread. 


#include <pthread.h> 


- 


t J . ; 
int pthread cancel(pthread t tid); 


i Returns: 0 if OK, nonzero on error 
sF 





, 1 ; I n, 
3 iF ” 


12.3.5 Reaping Terminated Threads 


Threads wait for other threads to terminate by calling the pthread_join function. 


#include <pthread.hb> 


F 
J i 
int pthread join(pthread t tid, void **thread return); 
] * 
Returns: 0 if OK, nonzero on error 





The pthread_join function blocks until thread tid terminates, assigns the generic | 
(void *) pointer returned by the thread routine to the location pointed to by 
i thread return, and then reaps any memory résources held by the"terminated 


thread. 
Notice that, unlike the Linux wait function, the pthread. join function can 
only wait for a specific thread to terminate. There is no way to instruct pthread_ 
jointo wait for an arbitrary thread to terminate. This can complicate our code by 
forcing us to use other, less intuitive mechanisms to detect process termination. 
Indeed, Stevens argues convincingly that-this is.a bug in-the specification [110]. 
1 “e ~ P 
12.3.6 Detaching Threads’ : 
1 ` J t we, l Pd 1 k 
Atany point'in timepa thread is joinable or detached. A:joinable thread cam be 
reaped and killed by»other.threads. Its memory resources (such as the stack) are 
not freed until it is reaped by anothéi.thread. In contrast, d detached thread cannot 
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be reaped or killed by other threads, Its memory resources are freed automatically 
by the system when it terminates. 

By default, threads are created joinable. In order to avoid memory leaks, each 
joinable thread should be either explicitly reaped by another thread or detached 
by a call to the pthread_detach function. 


#include <pthread.h> 


int pthread_detach(pthread_t tid); 





Returns: 0 if OK, nonzero on error 
1 


The pthread_detach function detaches the joinable thread tid. Threads can 
detach themselves by calling pthread_ detach with an argument of pthread_ 
self(). 

Although some of our examples will use joinable threads, there are good rea- 
sons to use detached threads in real programs. For example, achigh-performance 
Web server might create a new peer thread each time it receives a connection re- 
quest from a Web browser. Since each connection is Handled independently by a 
separate thread, it is unnecessary—and indeed undesirable—for the server to ex- 
plicitly wait for each peer thread to terminate. In this case, each peer thread should 
detach itself before it begins processing the request so that its memory resources 
can be reclaimed after it terminates. 


12.3.7 Initializing Threads 


The pthread, once function allows you to initialize the state associated with a 
thread routine. 


"include <pthread.h> 


pthread_once_t oncg_control = PTHREAD_ONCE_INIT ; 


int pthread_ once(pthread_once_t *once_control, 
void (*init_routine) (void)); 
Always returns 0 





Ny f 

The;once, control variable is a global or static variable that is always initialized 
to PTHREAD ONCE INIT. The first time you call pthread once with an ar- 
gument of once. control, it invokes init routine, which i is.a function with no 
input arguments that returns nothing. Subsequent calls to pthread.. once with the 
same 'once. control variable do nothing. The pthread, once function is useful 
whenever you need to dynamically initialize global.variables that are shared by 
multiple threads. We will look at an example in Section 12.5.5. 4 
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12.3.8 A Concurrent Server Based on Threads 


Figure 12.14 shows the code for a concurrent echo server based on threads. The 1 
overall structure is similar to the process-based design. The.main thread repeat- ) 1] 
edly waits for a connection request and then creates a ped? thread tochandle the 
request. While the code looks simple, there are a couple of general and some; 


27 /* Thread routine */ 
28 void *thread(void *vargp) 
29  ( 


what subtle issues we heed'tó look at more closely. The first issue is how to pass 
: | | 
O ————————————————— — —— — code/conc/echoservert.c 
1  finclude "csapp.h" | 
2 I" | 
3 void echo(int connfd); | | 
4 void *thread(void *vergp); 1 l 
5 i re | 
6. int main(int argc, char **argy) roo i i 
T2- st > H^ Ka r 5 i 
8 int listenfd, *connfdp; w^ i 
9| sogklen t clientlen; , t^ : 
10 struct sockaddr storage .clientaddr; 
Li pthread t tid;, a Oe ^ a d i 
12 
4 x 1 
13 if (argc fie 2) t i | 
14 printf (stder?, "usage: we’ “port>\n”, argv[01); , | 
15 exit(0); ^" T i : | 
16 } p 4 | 
17 listenfd = Opéh_listénfd(argv[1]); i 
ig I fl a 
1904's while (1) { . e | 
20 clientlen-sizeof(struct sockaddr storage); 
21 connidp = Malloc(sizeof(int)); 
22 *connfdp = Accept(listenfd, (SA *) kclientaddr, &clientlen); 
23 Pthread_create(&tid, NULL, thread, connfdp); ' 
24 'oy) [i Un ,.r F ft e, 
25 ) " HC ok | 
26 f 
I 


xum c E ui c ep oc LUE vec cow oup e c cua cort yr code/conc/echoservert. c 


: 1 « " " 


Figure 12.14 Concurrent echo server based on threads. 


30 int connfd + *((int' *) vargp) ; 2 i ; 
31 Pthread detach(pthread self); | 
32 Free(vargp); E 

33 echo(conritd) ; i 

34 *Close(connfd) ; T 

35 retürn NULL; , i at 
36 t} ~ 








992 Chapter12 ConcurrentProgramming 


the connected descriptor to the peer thread when-we call. pthread_create. The 


obvious approach is to pass a pointer to the descriptor, as in the following: 
b t 


'& connfd = Accept(listenfd, (SA *) &clientaddr, &clientlen); 
Pthread: creaťel&tid, NULL,, thread; &connfd); 


Then we have the peer thread dereference the pointer and assign it. to a local 
variable, as follows: 


void *thread(void *vargp) { 
‘ int connfd = *((int *)vargp); 


} 


This would be wrong, however, because it introduces a race between the assign- 
ment statement in the peer thread and the accept statement in the main thread. If 


the assignment statement completes before the next accept, then the local connfd: 


variable in the peer thread gets the correct descriptor value. However, if the as- 
signment completes after the accept, then the local connfd Variable in the peer 
thread gets the descriptor number of the next éonnection. Thé unhappy result'is 
that two threads are now performing input and output on the same descriptor. In 
order to avoid the potentially deadly race, we must assign each connected descrip- 
tor returned by accept to its own dynamically allocated memory block, as shown 
in lines 21-22. We will return to the issue of races in Section 12.7.4. 

Another issue is avoiding memory leaks in the thread routine. Since we are 
not explicitly reaping threads, we must detach each thread so that its memory 
resources will be reclaimed when it terminates (line 31). Further, we must be 
careful to free the memory block that was allocated by the mairi thread (line 32). 





in the. process- -based server in Fae 12. ote we were eareful to PIER the connected 
descriptor in two places: the parent process and the child process. However, in the 
threads-based server in Figure 12.14, we only closed the connected descriptor in 
one place: the peer thread. Why? 





12.4 Shared Variables in Threaded Programs 


From a programmer's perspective, one of the attractive aspects of threads is the 
ease with which multiple threads can share the same program variables. However, 
this sharing can be tricky. In order to write correctly threaded programs, we must 
have a clear understanding of what we mean by sharing and how it works. 

There are some basic questions to work through in order to understand 
whether,a variable, in a C program is shared or not; (1) What is the underlying 
memory model.for threads? (2) Given this model, how are instances of the vari- 
able mapped to memory? (3) Finally, how many tlireads'reference eath of these 


Mim. er. ae ee er ee eae ye Ee mica Ed tee a ge or 


AN ig ae 3799. is Tb ag MRE LN Ia ue LOS CIE ot ee, eee IL Ria Me eae, NC 
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i +4 


TM — 2: code/conc/sharing.é 


1 "include "csapp.h" 

2  sdefine N 2 i F s: 

3  !Void'*thread(void *vargp); 6^ ! 

4 

5 char **ptr; 7/* Global variable */ n t gf 
461 E ile {in 

WZ  int:main() o» gta d : - al 
8 ^. d de we "y 

9 int i; 4 r ^ 3 5 
10 4 . pthread t tid; à 

11 char *msgs[N] = { E 

2 “Hello from foo", ^ ES 

i: ~ "Hello from bar" ( 

wu! of, i Td | 
15 j 

16 pti = msgs; 

17 for (i =:0; il<4'N; i++) sa + 

18 Pthread create(&tid, NULL, thread, (void *)#)> 

19 Pthread_exit (NULL) ; 
20 } 
21 H or. S. 
22 qoid *thread(void *vargp) 
53 Du “OD, r L d LN 
24 int myid - (int)vargp; 3 
25 Static int cnt 0; 
26. ! printf("[Zd]: %¢ “Cént=%d)\n", myid, ptr[myid], ++cnt) ; » 
27 ‘return NULL; ' 
28, ,} ! 


1 ri on Li i 


ods 
Figure 12.15.., Example program that illustrates different aspects of sharing. 


1i 


4— 'code/cónc/sharing.c 


instances? The variable is shared'if and only if multiplé threads réfetence'$ome 
instance of the variable. ^ tuot’ i 

To keep our'distussion of shafing Concrete, we will use the progtáfn in Fig- 
urd 12.15'as a tunning'example. ‘Although somewhat'contrived, itis nonetheless 
useful to’study because it il[ustrate3'à number of subtle points about sharing. The 
example program consists of a m&in thtead'that creates two péer threads. The 
tain thread passés a unique ID to each péerthread, which uses the ID to print 
a personalizéd messagé alóhg'with a'éount of the total number of time’ that the 


thread routine has béer invoked. di 
12.4.1 Thiéads Mernory Model Wn Qum 
i pI a {ts 


A pool-of concurrent, threads runs:in the context'of a process: Each:thread has 
its own separate thread context, which includes a thread ID, stack, stack pointer, 
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program counter, condition codes, and general-purpose register values. Each 
thread shares the rest of the process context with the other threads. This includes 
the entire user virtual address space, which consists of read-only text (code), 
read/write data, the heap, and any shared library code and data areas. The threads 
also share the same set of open files. 

In an operational sense, it is impossible for one thread to read or write the 
register values of another thread. On the other hand, any thread can access any 
location in the shared virtual memory. If some thread modifies a memory location, 
then every other thread will eventually see the change if it reads that location. 
Thus, registers are never shared, whereas virtual memory is always shared. 

The memory model for the separate thread stacks is not as Clean. These 
stacks are contained in the stack area of the virtual address space and are usually 
accessed independently by their respective threads. We say usually rather than 
always, because different thread stacks are not protected ‘from other threads. So 
if a thread somehow manages to acquire a pointer to another threads stack, then 
it can read and write any part of that stack. Our example program shows this in 
line 26, where the peer threads reference the.contents of the main thread's stack 
indirectly through the global ptr variable. 


12.4.2 Mapping Variables to Memory 


Variables in threaded C programs are mapped tó' virtual memory according to 
their storage classes: 


Global variables. ^ global variable is any variable declared outside of a func- 
tion. At run time, the read/write area of virtual memory contains exactly 
one instance of each global variable that can be referenced by any thread. 
For example, the global ptr variable declared in line.5.has one run-time 
instance in the read/write area of virtual memory. When there is only one 
instaríce of a variable, we will denote the instance by simply using the 
variable name—in this case, ptr. 


Local automatic variables. A. local automatic variable is one that is declared 
inside a function without the static attribute. At run time, each thread’ S 
stack contains its own, instances, of any local automatic variables. "This 
is true even if, multiple threads execute, the, same thread routine. Eor 
example, there is one instance of the local variable tid, and it resides 
on the stack of the main ‘thread, We will denote this instance as tid.m. 
As another example, there are two instances of the local variable myig, 
one instance on the, stack of peer thread 0 and the other on.the stack of 
peer thread 1. We will denote these instances as myid.pO and myid.p1, 
respectively. 


Local static variables. A local static variable is one that is declared inside a func- 
tion with the static attribute. As with global variables, the read/write 
area'of virtual memory contains exactly-one instance.of each local static 
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thread in our example program declares cnt in line 25, at run time there is 
only one instance of cnt residing in the read/write area of virtual memory. 
Each peer thread reads and writes this instance. 

3 


12.43 Shared Variables | 
i 
i 


variable declared in a program. For example, even though each peer 


M 
We say that a variable v is shared if and only if one of its instances is referenced 
by more than one thread. For example, variable cnt in our example program is 
shared because it has only one run-time instance and this instance is referenced by 
both peer threads. On the other hand, myid is not shared, because each of its two | 


instances is referenced by exactly one thread. However, it is important to realize 
i 


that local automatic variables such as msgs can also be sharéd. 







eu e 





owing table 
with “Yes” or “No” for the example program in Figuré 12.15. In the first 
column, the notation v.t denotes an instance, of variable v residing on the 
local stack for thread t, where t is either m (main thread), po (peer thread 0), 
or pi (peer thread 1). 


Mee a ua omms ae I 





Variable t Referenced by 
instance main thread? peer thread 0? peer thread 1? | 
cnt 

i.m S 

msgs.m . EUM 7 li 

myid.pO 

myid.pi ] ENKERES 


B. Given the analysis in part A, which of the variables ptr, cnt, i, nsgs, and 
myid are shared? 


xx 
——M————————————————————————————M———————— Ma 


LES 


12.5 Synchronizing Threads with Semaphores , 


Shared variables can be convenient, but they introduce the»possibility of nasty 
synchronization errors. Consider the badcnt:c program in Figure 12.16, which 
creates two threads, each of which increments a global shared counter variable 
called cnt. Sud 

Since each thread increments the counter niters times, we expect its final 
value to be 2 x niters. This seems quite simple and straightforward. However, 
when we run badent.c on our Linux system, we not only get wrong angwers, we 








" t 


get different answers each tithe! 
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code/conc/badcnt.c 
1  /* WARNING: ‘This code is buggy! */ 
2  #include "csápp.h" 
3 
4 void *thread(void *vargp); /* Thread routine prototype */ 
5 
6 /* Global shared variable */ 
7 volatile long cnt = 0; /* Coulter */ 
8 f 
9  int4gain(ipt argc, charrp*argv) 
10 ( . p 
11 long niters; 
12 pthread t tidi, tid2; a 
13 
14 /* Check input argument */ 
15 if (arge != 2 ( i 
16 printf("usage: ^s <niters>\n", argv[0]); 
17 exit(0); ; 
18 } 
19 titers = atoi (argv[1]); ie 
20 
21 /* Create threads and wait for them to finish */ 
22 Pthread_create(&tidi, NULL, thread, &niters); 
23 Pthread_create(&tid2, NULL, thread, &niters); 
24 Pthread_join(tidi; NULL); 
25 Pthread_join(tid2, NULL); 
26 
27 /* Check result */ 
28 if (cnt != (2 * niters)) 
29 printf("BOOM! cnt=%1ld\n", cnt); 
30 else f 7 
31 printf ("OK cnt=%1d\n", cnt); 
32 exit (0); 
33.0} P 4 
34 


35 /* Thread routine */ 
36 void *thread(void *vargp) 


37 { 

38 long i, niters = *((long *)vargp); ` L 
39 

40 for (i = Os i < niters; i++) 

41 cnt 10 ori pe o: ` 

42 3 2A 4 a 

43 return NULL; 5 

44 l tila a fa ' 

at Fee code/conc/badcnt.c 


Fo PED 3. s P ; n t 
Figure 12.16 badcnt. c: An improperly synchronized.counter program. 
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linux»  ./badcnt 1000000 
BOOM! cnt=1445085 | 


linux> ./badcnt 1000000 
BOOM! cnt-1915220 


linux>  ./badcnt 1000000 


BOOM! cnt=1404746 
E 


So what went wrong? To understand the problem clearly, we need to study 
the assembly code for the counter loop (lines 40-41), as shown in. Figure 12.17. 
We will find it helpful to partition the loop code for thread i into five parts: 


H;: The block of instructions at the head'of the loop 





L;: The instruction that loads the.shared variable cnt into the.accumulator — , 
register %rdx;, where %rdx; denotes the value of register rdx in thread i 


U;: The instruction that updates (increments) “rdx; ' 
È 


S;: The instruction that stores the updated,value of %rdx,,.back to the shared 
variable cnt . j a 4 
;: Thé block of instructions at the tail of thé loop : 


, pt 


n 


Notice that the head and tail manipulate only, local ‘st stack variables, while L;, U;, 
and S; manipulaté the contents of the shared counter ‘variable. 

When the two peer threads in badcnt. crun concurrently ona ‘uniprocessor, 
the ; machine i instructions are completed one after the other in Some order; Thus, 
each concurrént execution defines some. total ordering (or interleaving) of the in- 
structions in the two threads. Unfortunately, some of these orderings will produce 
correct results, but others will not. 














Asm code for thread i. 


movq (Xrdi), *rcx 
testq Arcx, %rcx 
jle .L2 


Hj: Head 


$0, Xeax 
C code for thread i | 
T ae a £;: Load cnt | 
for (i = 0; i < niters; i++) cnt (4rip) ,%rdx U,: Update cnt 1 
cnttt; feax P» 
Yeax, cnt (rip) Sj: Store cnt | 
Arex, Xrax 7: Tail 


2 





Figure 12.17 Assembly code for the counter loop (lines 40-41) in badent. c. 
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(a) Correct ordering 


(b) Incorrect orderihg 


te 


Step Thread Instr. ‘%rdx,  Xrdx; cnt Step Thread Instr. 4rdX; %rdx, cht 
1 1 H, E = 0 1 1 Hy = = 0 
2 1 Li 0 == 0 2 1 Li 0 — 0 
3 1 U, 1 = 0 3 1 U: 1 =. 0 
4 1 Sı 1 — 1 4 2 Hy — a. 3:0 
5 2 Hoo wm = 1 5 2 Lz = 0 0 
6 2 Ë = 1 1 6 1 € ow L = wd 
7 2 Us = a2 1 7 1 4A Tr, ad "2 
8 2 S e 2 2 8 52 Durs Cad 1, 
9 2 T, m 2 2 9 2 $5 — 1 1 
10 1 Ti 1 — 2 ' WO , 2 T, *— 1 1 


Figure 12.18 Instruction orderings for the first loop iteration in badcnt.c. f ' ‘ 
e : e 


a T 

Here is the crucial point: In general, there is no way for you td predict whether 
the operating systém will'choose a correct ordering'for your threads. For example, 
Figure 12.18(a) shows the step-by-step operation of a correct instruction ordering. 
After each thread has updated the shared variable cng, its value in memory is 2, 
which is the expected result. 

On the other hand, the ordering in Figure 12.18(b) produces ap incorrect value 
for cnt. The problem occurs because thread 2 lóàds'ént ‘ih step 5; aftér thread 1 
loads cnt in step 2 but before tHiréad 1 Stores ifs updatéd value instep 6. Thus, each 
thread ends up storing’ an updated countér value of 1. We can clarify these notions 
of correct and incorrect instruction orderirigs with the help óf a device known å 
a progress graph; which we intréduce in the next séctión. AC 

J i 








Step Thread = Instr: _ Ardxy 4rdx; cnt. 
1 1 Hı — 27 0 
2 1 Li SOME 
3 2 M x | eee 
4 2 L5 a x Ux 
5 2 ^U nes 
6 2 $5 oye 
7 1 Ui T MEE Euer DTE 

Step Thread Instr. ^rdxi 4rdx, cnt 
8 1 Se tee, ee eta 
9 1 | m0 ais? z ws . 





= 


al 





10 2 T; 


Does this ordering result in a correct value for cnt? 


| 


12.5.1 Progress Graphs 


A progress graph models the execution of icodeimrent threads as a trajectory 
through an n-dimensional Cartesian space. Each axis k corresponds to the progress 
of thread k. Each point (1), D, . . - , I,,) represents the state where thread k (k = 
1,..., n) has completed instruction 7,. The origin of the graph corresponds to the 
initial state where none of the threads has yet completed an instruction. 

Figure 12.19 shows the two-dimensional progress graph for the first loop 
iteration of the badcnt .c program. The horizontal axis corresponds to thread 1, 
the vertical axis to thread 2. Point (L4, $5) corresponds to the state where thread 
1 has completed LZ, and thread 2 has completed Sz. 

A progiegs graph models instruction execution as a transition’ from one state 
to another. A transition is represented às a directed*edge from one point to an 
adjacent point. Legal transitions move to the right (an instructich "ifí«thread:1 
completes) or up (an instruction in thread 2 completes). Two instructions cannot 
complete at the same time—diagona] transitions are not allowed. Programs never 
run backward so transitions that move down or to the left are not legal either. 


1 4 ul 


lot i i 
1 


av, i 


Figure 12.19 

Progress graph for the 
first loop iteration of 
badent .c. tog 
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Thread 1 
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Figure 12.20 Thread 2 
An example trajectory. 


"Thread 1 





The execution history ef a program is modeled as a trajectory'through the 
state space. Figure 12.20 shows the trajectory that corresponds to the, following 
instruction.ordering: y i a 

^ ay 


Hy, BA, Uy, HojLa, Si, Ty, Uz S2 Tr 
Li 


4 in x 


For thread i, the instructions (L;, U;, 5;) that manipulate the contents of the 
shared variable cnt constitute a critical section (with respect to shared variable 
cnt) that should not be interleaved with the critical section of the other thread. In 
other words, we want to ensure that each thread has mutually exclusive access to 
the shared variable while it is executing the instructions in its critical section. The 
phenomenon in general is known as mutual exclusion. 4 

On the progress graph, the intersection of the two critical sections defines 
a region of the state space known as an unsafe region. Figure 12.21 shows the 
unsafe region for the variable cnt. Notice that the unsafe region abuts, but does 
not include, the states along its perimeter. For example, states (H1, H5) and (54, U2) 
abut the unsafe region, but they are not part of it. A trajectory that skirts the unsafe 
region is known as a safe'trajectory. Cohversely, a trajectory that touches any part 
of the unsafe region is an unsafe trajectory. Figure 12.21 shows examples of safe 
and unsafe trajectories through the state space of our example badent . c program. 
The upper trajectory skirts the unsafe region along its left and top sides, and thus 
is safe. The lower trajectory crosses the unsafe region, and thus is unsafe. 

Any safe trajectory will correctly update the shared counter. In order to 
guarantee correct execution of our example threaded program—and indeed any 
concurrent program that-shares global data-structures—we must somehow syn- 
chronize the threads so that they always have a safe trajectory. A classic approach 
is based on the idea of a semaphore, which we introduce next. 
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Figure 12.21 Thread 2 
Safe and unsafe 
trajectories, The 
intersection of the critical T2, 


e * 
* 


Safe trajectory 


regions forms an unsafe 

region. Trajectories that S, Unsafe 

skirt the unsafe region trajectory j | 
correctly update the,^ ^ — Critical 

counter variable. » «© section 4, Uo 


op Went |. 


Thread 1 


Using then progress raphi in Ficus 12. 21, di thé » Tollowing trajectories as | 
gither safe or unsafe. | 


A. Hy, Ly Ut, Sp, Ho, Lo, Un, Sp» Tj, Ti 
B. H4, Lo, Ay, Lis Uy, S, T, Uz Sy, Th 
C. H, th. La, Us S Li, Un Sp Th To 


7 FF 1 


12.5.2 Semaphores . t 
fa i "j t r 


Edsger Dijkstra, a pioneer of coricurrent programming, proposed a classic solution 
tothe problem of synchronizing different execution threads based'ón a special 
type of variable called a semaphore. A sémaploré, s; istá!global variable with a 
nonnegative integer value that can only be manipulated by two special operations, 
called P and V: " i 


— — — ——— M — — M ———— — Ó—— P — | — — 


P(s): Its is nonzero, thén' P decrements ahd returns immediately. If s is 
zero, then suspend the thread until s “betorfies nonzero and the thread is 
‘restarted by a V operation. After restarting, the P operation decrements 
s and returns control to the caller. 


V(s): The V operation'‘increments s by 1. If there are any threads blocked ata P | 
operation waiting for s to become nonzero, then the V operation restarts 

exactly one of.these threads; which then completes its»P operation by 

decrementing s. : i 
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The test and decrement operations in P occur indivisibly, in the sense.that 
once the semaphore s becomes nonzero,,the decrement of s occurs without in- 
terruption. The increment operation in V also occurs indivisibly, in that it loads, 
increments, and stores the semaphoré without interruption. Notice that the defi- 
nition of V does not define the order in which waiting threads are restarted. The 
only requirement is that the V must restart exactly one waiting thread. Thus, when 
several threads are waiting at a semaphore, you cannot predict which one will be 
restarted as a result of the V. 

The definitions of P and V ensure that a running program can never enter a 
state where a properly initialized semaphore has a negative value. This property, 
known as the semaphore invariant, provides a powerful tool for controlling the 
trajectories of concurrent programs, as we shall see in the next section. 

The Posix standard defines a variety of functions for manipulating sema- 
phores. 


#include <semaphore.h> 


int sem init(sem t *sem, 0, unsigned int value); 


int sem wait(sem t *s);  /* P(s) */ 
int sem post(sem t *s);  /* V(s) */ 
Returns: 0 if OK, —1 on error 





The sem. init function initializes semaphore sem to value. Each semaphore must 
be initialized before it can be used. For our purposes, the middle argunient is 
always 0. Programs perform P and V operations-by, calling the sem_wait and 
gem. post functions, respectively, For conciseness, we,preféy,to use the following 
equivalent P and V wrapper functions instead: , i 


7 " l n 
#include "csapp.h" 


void P(sem_t, *S); /* Wrapper function for sem_wait */ 
^ T 


void V(sem t *s); /* Wrapper function fof sem post */ 
: Returns: nothing 





12:5.3 Using Serrlaphores for Mutual Exclusion ” ^ 
ay 


Semaphores provide a convenient way to ensure mutually exclusive access to 
shared variables. The basic idea is to associate a semaphore s, initially 1, with 
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Thread 2 


Initially 
s=1 





Thread 1 


H P() L Ui $i V(s) Ty 


Figure 12.22 Using semaphores for mutual exclusion. The infeasible states where 
s < 0 define a forbidden region that surrounds the unsafe’ region and prevents any feasible 
trajectory from touching the unsafe region. 


each shared variable (or related set of shared variables) and then surround the 
corresponding critical section with P(s) and V(s) operations. x 

A semaphore that is used in this way to protect shared variables is called a 
binary semaphore because its value is always 0 or 1. Binary semaphores whose 
purpose is to provide mutual exclusion are often called mutexes. Performing a 
P operation on a mutex is called locking the mutex. Similarly, performing the 
V operation is called unlocking the mutex. A thread that has locked but not yet 
unlocked a mutex is said to be holding the mutex. A semaphore that is used as a 
counter for a set of available resources is called a counting semaphore. 

The progress graph in Figure 12.22 shows how we would use Binary sema- 
phores to properly synchronize our'example counter program. 

Each state is labeled with the value of semaphore s in that state. The crucial 
idea is that this combination of P and V operations creates a collection of states, 
called a forbidden region, where s « 0. Because of the semaphore invariant, no 
feasible trajectory can include one of the states,in the forbidden region, And since 
the forbidden region completely encloses the unsafe region, no feasible trajectory 
can touch any part of the unsafe region. Thus, every feasible trajectory is safe, and 
regardless. of the ordering of the. instructions at run time,.the program correctly 
increments the counter. 
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In an operational sense, the forbidden region created by the P and V op- 
erations makes it impossible for multiple threads to be executing instructions in 
the enclosed critical region at any point in time. In other words, the semaphore 
operations ensure mutually exclusive access to the critical region. 

Putting it all together, to properly synchronize the example counter program 
in Figure 12.16 using semaphores, we first declare a semaphore called mutex: 


volatile long cnt = 0; /* Counter */ 
sem t mutex; /* Semaphore that protects counter */ 


and then we initialize it to unity in the main routine: ' 
Sem init(Emutex, 0, 1); /* mutex = 1 */ 


Finally, we protect the update of the shared cnt variable in the thread routine by 
surrounding it with P and V operations: 


for (i = 0; i < niters; i++) { 


P(&mutex) ; 
tntt+; 
V(&mutex) ; 
ae: 
When we run the properly synchronized program, it now produces the correct 
answer each time. - e? 


linux> ./goodcnt 1000000 
OK cnt-2000000 


linux»  ./goodcnt 1000000 
OK cnt=2006000 
12.5.4 Using Semaphores to Schedule Shared Resources 


Another important use of semaphores, besides providing mutual exclusion, is to 
schedule accesses to shared resources. In this scenario, a thread uses a semaphore 
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Producer 
thread 


Consumer 
thread 


Bounded 
buffer 
Figure 12.23 Producér-consumíer problem. The producer generates item$ and inserts 
them into a bounded buffer. The consumer rémoves items from’the buffer and then’ 

consumes them. 


« 


operation to notify another thread that some condition in the program state has 
become true. Two classical and useful examples are the producer-consumer and 
readers Writers problems. 

bu 3 E 4 m 


Producer-Consumer Problem 


The producer-consumer' problem is shown in Figure 12.23.‘A produéet and'con- 
sumer thread share a bounded buffer with n slots. The producer thread repeatedly 
produces hew items and inserts them in the buffer! The consumer thread’ repeat: 
edly removes items from the buffer and'tlien consumés (usés) thèm. Variants ith 
niultiplé producers and consumers are also possible, "I 

Since inserting and removing itetiis involves updating shared variables, we 
must guarantee mutually exclusive access to the buffer. But guaranteeing mutiial 
exclusion is not’stfficignt. We also need' to schedule accesses:to the büffer. If the 
buffer is full (there are no empty slots); then the producer must Wait until a slot 
becomes available. Similarly, if the buffer ek empty (there are no available items); 
then the consumer must wait until an item becomes available. 

Producer-consumer interactions occur frequently in real systems. For exam- 
ple, in a multimedia system, the producer might, encode video frames while the 
consumer decodes,and renders them on the screen. The purpose of the buffer i is 
to reduce jitter in the video stream caused by dafa- dependent, differences i in the 
encoding and decoding times for individual frames. The buffer provides a reser- 
voir of slots to the producer and a reservoir of encoded ffames to the consumer. 
Another common example is the design of graphical user interfaces. The producer 
detects mouse and keyboard events and inserts them in the buffer. The consumer 
removes the events from the buffer in some priority-based manner ahil paints the 
screen. 

In this section, we will develop a simple package, called Spur, for building 
producer-consumer programs. In the next section, we look at how to use it to 
build an interesting concurrent server.básed on prethreading. Spur mahipulates 
bounded buffers of type sbuf _t (Figure 12.24). Items are stored in a'dynamically 
allocated integer array (buf) with n items. The front and rear indices keep 
track of.the first and last items in the array. Three semaphores.synchronize access 
to the buffer. The mutex semaphore provides:mutually exclusive buffer access. 
Semaphores slots and items are counting semaphores’ that count the number of 
empty slots and available itenis;respectively. 
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code/conc/sbufh 
1 typedef struct { 
2 int *buf; /* Buffer array */ 
3 int n; /* Maximum number of slots */ 
4 int front; /* buf [(front*i)Zn] is first item */ 
5 int rear; /* buf[rearZnl is last item */ 
6 sem t mutex; /* Protects accesses to buf */ 
7 sem t slots; /* Counts available slots */ 
8 sem t items; /* Counts available items */ 
9 ) sbuf, t! 
k code/conc/sbuf.h 





Figure 12.24 sbuf_t: Bounded buffer used by the Seur package. 


Figure 12.25 shows the implementation of the SBur package. The sbuf. init 
function allocates heap memory for the buffer, sets front and rear to indicate 
an empty. buffer, and assigns initial values to the,three semaphores. This function 
is called once, before, calls to any of the-other three functions. The sbuf_deinit 
function frees the buffer storage when the, application is through using it: The 
sbuf insert function waits for an available slot, locks the mutex, adds the item, 
unlocks the mutex, and then announces the availability of a new item. The sbuf_ 
remove function is symmetric. After waiting for an available buffer item, it locks 
the mutex, removes the item from the front of the buffer, unlocks:‘the mutex, and 
then signals the availability of a new,slot. ' 


EES E cola imac Lae MOERS 
Lét p denote the humber of producers, c the number of consumers, and n the 
buffer size in units of items. For each of the following scenarios, indicate whether 
the mutex semaphore in sbuf, insért and sbuf. remove is necessary of nót. 





A. p=1,c=1,n>1 

Bi p-1,cz1,5—1 M 
L 

C. p>l,c>1,n=1 





tr . 
Readers-Writers Problem 


The readers-writers problem is a generalization of the mutual exclusion problem. 
A collection of concurrent threads is accessing a shared object such as a data 
structure in main memory or a database on disk. Some threads only read the 
object, while others modify it. Threads that modify the'object are called writers. 
Threads that only read it are called readers. Writers must have exclusive access to 
the object, but readers may share the object-with an unlimited.number of other 
readers. In general, there are an unbounded number of concurrent readers and 
writers. 
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í 
none = Ai re ll icu.  - Q0 a E T Re. e DR a code/conc/sbuf.c 


include "csapp.h" 5 


#include "sbuf.h" 
Pt 

/* Create an empty, bounded, shared FIFÜ'buffer.'with^n'slots */ 

void sbuf init(sbuf t *sp, int n) t 3 ft 

{ 
Sp->pbuf,= Calloc(n, sizeof (ipt)); , 
sp->n = n; : vd „ /* Buffer holds max of n items */ 
sp->front = sp->rear = 0;, /* Empty buffer iff front == rear */ 
Sem_init(&sp->mutex, 0, 1); , /* Binary semaphore for locking */ 
Sen init (sp->slots, 0,0); /* Initially, buf has n empty slots */ 
Sen. K init (sp-»itenis; 0, 0; /* Initially, buf has’ zero data items */ 


^) soa 


1 i 


/* Clean up buffer sp */ 
void: sbuf, deinit(sbuf tL*sp) 
1 i , 
Free(sp-»buf); i 

i, i } 

1 i 
/* Insert item onto the rear of shared buffer p « 
void sbuf -insertigbuf t, t *SB», pnt item) 











RR 


£t oO 


‘P(usp-delate) ; eek for availablé slot'*/ 
P (ksp-»mutex); /* Lock thè buffer */ 
spobüf[C-sp-rear)'Q(bp-5n)] = item; /* Insert the ifem +/ 
V(&sp->mutex) ; /* Unlock the buffer */ 
V(&sp->items) ; if ~? /#lAnnounce available item */ 
} ies 

20 2 " 
/* Remove,and return the first, item from buffer sp #/ 
int sbuf remove(sbuf t *sp) 


{ 


=F 3a 


int item; 

P(&sp->items) ; 3 /* Wait for available item */ 
P(&sp- >mutex) ; /* Lock the buffer x/ 

item = sp- >but [(++sp->front)%(sp->n)] ; /* Remove the item */ 
"V(&sp-^mutex); /* Unlbck the buffer «/ 
"V(&sp-?slots); v /* Annòuhce available slot */ 
return item; j 


Zon 


* 


} 
code/conc/sbuf.c 


o : AA uu "P à 
Figure 12.25- Spur: A package for synchronizirig concurrent access to bounded buffers. 
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Readers-writers.interactions occur frequently in real systems. For example, 
in an online airline reservation system, an unlimited number of customers are al- 
lowed to concurrently inspect the seat assignments, but a customer who is booking 
a seat must have exclusive access to the-database. As another example, in a multi- 
threaded caching Web proxy, an unlimited number of threads can fetch existing 
pages from the shared page cache, but any thread that writes a new page:to the 
cache must have exclusive access. 

The readers-writers problem has Several variations, each based on the priori- 
ties of readers and writers. The-first readers-writers problem, which favors readers, 
requires that no reader be kept waiting unless a writer has already béen granted 
permission to use the object. In other words, no reader should wait simply because 
a writer is waiting. The second readers-writers problem, which favors writers, re- 
quires that once a writer is ready to write, it performs its write as soon as possible. 
Unlike the first problem, a reader that arrives after a writer must wait, even if the 
writer is also waiting. 

Figure 12.26 shows a solution to the first.readers-writers problem. Like the 
solutions to many synchronization problems, it is subtle and deceptively simple. 
The w semaphore controls access to the critical sections that-access the shared 
object. The mutex semaphore protects access to the shared readcnt variable, 
which counts the number of readers currently i in the critical section. A writer locks 
the w mutex each time it enters the critical section and‘unlocks it each time it leaves. 
This guarantees that there is at most one writér in the critical section'at ‘any point 
in time. On the other hand, only the first reader to enter the critical section locks 
w, and only the last reader to leave the critical section unlocks it.The w mutex 
is ignored by readers who enter and leave while other readers are present. This 
means that as long as a single reader holds the w mutex, an unbounded number of 
readers can enter the critical section unimpeded. 

A correct solution to either of the readers-writers problems can result in 
starvation, where a thread blocks indefinitely and fails to make progress. For 
example, in the solution in Figure’ 12.26, a writer ‘could'wait indefinitely while 
a stream of readers arrived. 





The sohition to ) the first raden weitere problem i in Figure 12:26 gives priority to 
readers, but this priority is weak in the sense that a writer leaving its critical section 
might restart a waiting writer instead of a waiting reader. Describe a scenario 
where this weak priority would allow a collection of writers to starve a reader. 





12.5.5 Putting tt Together; A Concurrent Server Based on Prethreading 


We have seen how semaphores can be used to access shared variables and to 
schedule accesses to shared resources. To help you understand these ideas more 
clearly, let us apply ‘them to a concurrent server based on a technique called 
prethreading. 








Section 12.5 Syachronizing Threads with Semaphores 


/* Global variables */ 
int readcnt; /* Initially = 0 */ 
sem t mutex, w; /* Both initially = 1 */ 4, 

A ^ t 5 
void reader(void) 
1 : 
while (1) { 

P(&mutex) ; 1 

readcnt++; > 

if (readcnt == 1) /* First in */ 

P(&w) ; 
V(&mutex) ; 


/* Critical section */ 
/* Reading happens */ 


P(&mutex) ; 

readcnt--; 

if (readcnt == 0) /* Last out */ 
V(&w) ; 

V(&mutex) ; 


} 
void writer (void) 
pru x 
while (1) { 
P(&w); 


OTL X 


liu 
"78 Critical section */ 
,/* Writing happens */ 


i 


V(aw) ; 
pres 


} i L jt 
v " h 1 n 3 
Figure 12.26 Solution to the first readers-writers problem. Favors readers over 
writers. u ie 
a 4 ‘ 1 
T i 

In the concurrent server in Figure 12.14, we created a new thread for each 
hew client. A disadvantage of this approach is that we, incur the nontrivial:cost 
oficreating a new thread for each new client. A server'based on: prethreading 
tries to:reduce. this. overhead by using the producer-consumer model shown.in 
Figure 12:27. The server consists of a main thread anda set of worker; threads. 
The main thread répeatedly accepts connection requests from clients and places 
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F r & m E, Yog E 


Aside Other synchronization mechanisms — , " Ec ae 


We have shown you how tó synchronize threads using ‘seniaphdres, mainly because they are simple, clas-. 
sical, and have a clean semhantic‘ model. Bit you shold küow that otlier synchronization técHniques 
exist as well. For example, Java threads are' synchronized with a'rüechanisntcalled aJava monitor (48), 
which provides a higher-level abstraction of thé mutual exclision and scheduling capabilities of Serha>*? 
phores; in fact, monitors ‘can be 'implementéd With semaphores. As, another example, the Pthreads * 
interface defines a set of synchronization operations on mutex’and conditién: variables. Pihreads mu- 
texes are used for mutual exclusion. Condition variables are used for scheduling accesses to shared * 


resources, such as the bounded "Duffér itt a produéer- -consumer program. * 1 
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Figure 12.27 Organization of a prethreaded concurrent server. A set of existing 
threads repeatedly remove and process connected descriptors from a bounded buffer. 


the resulting connected descriptors in a bounded buffer. Each worker thread 
repeatedly removes a descriptor from the buffer, services the client, and then waits 
for the next descriptor. 

Figure 12.28 shows how we would use the Spur package to implement a 
prethreaded concurrent echo server. After initializing buffer sbuf (line 24), the 
main thread creates the set of worker threads (lines 25-26). Then it enters the 
infinite server loop, accepting connection requests and inserting the resulting 
connected descriptors in sbuf. Each worker thread has a very simple behavior. 
It waits until it is able to’ témove a connécted descriptor from the buffer (line 39) 
and then calls the echo. cnt function to echo client input. "om 

The echo. cnt function in Figure 12.29 is a version of the echo function 
from Figure 11.22 that records the cumulative number of bytes received from 
all clients in a global variable called byte. cnt. This is interesting code to study 
because it shows you a general technique for initializing packages that are called 
from thread routines. In our case, we need to initialize the byte. cnt counter 
and the mutex semaphore. One approach, which we used for the SBuriand Rio 
packages, is to require the main thread to explicitly call an initializatidn’ function. 
Another approach, shown here, uses the pthread, once function (line 19) to-call 
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nn — — code/conc/echoservert-pre.c 





1 #include "csapp.h" 

2 #include "sbuf.h" 

3 #define NTHREADS 4 i 
4 #define SBUFSIZE 16 à 
5 

6 void echo cnt(int connfd); | 
7 void *thread(void *vargp); | 
8 

9 Sbuf t sbuf; /* Shared buffer of connected descriptors */ 

10 

11 imt main(int argc, char **argv) 

12 1 

13 int i, listenfd, connfd; 

14 Socklen t clientlen; 

15 Struct sockaddr storage clientaddr; 

16 pthread t tid; 

17 

18 if (argo != 2 ( 

19 fprintf(stderr, "usage: 4s «port» Wn", argv[0]); 

20 exit(0); 
2i ) i 

22 listenfd = Open_listenfd(argv[1]); 


sbuf_init(&sbuf, SBUFSIZE); 


N N 
A w 
TEE Gees N A a 





25 for (i = 0; i < NTHREADS; i++) /* Create worker threads */ 

26 Pthread_create(&tid, ,NULL, thread, NULL); 

27 

28 while (1) ( 

29 clientlen = sizeof(struct sockaddr Storage) ; 
j 30 connfd = Accept(listenfd, (SA *) uclientaddr, &clientlen) ; 

31 sbuf_insert(&sbuf, connfd); /* Insert connfd in buffer */ 

32 } 

33 } 
| 34 
| 35 ‘võid *thread(void *vVargp) 
| 36 of 
| 37 Pthread detach(pthread self()); 
| 38 while (1) { | 
39 int connfd = sbuf_remove(&sbuf); /* Remove connfd from buffer */ | 
| 40 echo cnt(connfd); 5 Hu * /* Service client */ | 
| 41 CXose(connfd); ts 
42 } i a 

43 ) 


| i 
ne a E E NC C code/conc/echoservert-pre.c 


| Figure 12.28 A prethreaded concurrent echo server. The server uses a producer-consumer model with 
| one producer and multiple consumers. 
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S ——-__ code/concfecho-cnt.c 
#include "csapp.h" 


1 

2 

3 static int byte cnt; /* Byte counter */ 

4 static sem t mutex;  /* and the mutex that protects it */ 
5 

6 static void init echo  cnt(void) 

; t 

8 Sem init(Emutex, 0, 1); 

9 byte. cnt - 0; 

10 P E 
1i 

12 void echo cnt(int connfd) 

13  ( 

14 int n; 

15 char buf [MAXLINE] ; s 

16 rio. t rio; 

17 static pthread, once t once = PTHREAD ONCE INIT; 

18 ‘ 

19 Pthread once(&£once, init echo, cnt); 

20 Rio readinitb(&rio, connfd); 

21 while((n = Rio readlineb(E£rio, buf, MAXLINE)) != 0) 1 
22 P(&mutex); . 3 

23 byte_cnt += n; 

24 printf("server received %d (%d total) bytes on fd %d\n", 
25 n, byte cnt, connfd); 

26 V(&mutex); 

27 Rio, writén(connfd, buf, n); 

28 +" i 

29. ) : : 


code/conc/echo-cnt.c 





Figure 12.29 echo cnt: A version of echo that counts all bytes received from 
clients. 


the initialization function the-first time some thread calls the echo, cnt function. 
The advantage of this approach is that it makes the'pàckage easier to use. The 
disadvantage is that every call to echo. cnt makes a call to pthread, once, which 
most times does nothing useful. t 

Once the package is initialized, the echo_cnt function initializes the Rio 
buffered I/O package (line 20) and then echoes each text line that is received from 
the client. Notice that the accesses tó thé shared byté cnt variable in lines 23-25 
are protected by P and V operations. S 
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12.6 Using Threads for Parallelism 


Thus far in our study of concurrency, we have assumed concurrent threads exe- 
cuting on uniprocessor systems. However, most modern machines have multi-core 
processors. Concurrent programs often run faster on such machines because the 
operating system kernel schedules the concurrent threads in parallel on multi- 
ple cores, rather than sequentially on a single core. Exploiting such parallelism 
is critically important in applications such as busy Web servers, database servers, 
and large scientific codes, and it is. becoming increasingly useful in mainstream 
applications such as Web browsers, spreadsheets, and document processors, 

* Figure 12.30 shows the set relationships between sequential, concurrent, and 
parallel programs. The set.of: all programs can be partitioned into the disjoint 
sets of sequential and concurrent programs. A sequential-program is written as a 
single logical flow. A concurrent program is written as multiple concurrent flows. 
A parallel program is a concurrent program running.on multiple processors. Thus, 
the set of parallel programs is a proper subset of the set of concurrent programs. 

A detailed treatment of parallel programs is beyond our scope, but studying 
a few simple example programs:will help you understand.some important aspects 
of parallel programming:For example, consider how we might sum the sequence 
of integers 0, ..., n— 1 in parallel. Of course, there is a.closed-form solution for 
this particular-problem, but nonetheless it is:a concise and easy-to-understand ex- 
emplar that will'allow us to make some interesting points about parallel programs. 

The most straightforward approach for assigning work to different threads is 
to partition the sequence into f, disjoint regions and then assign each of t different 
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threads to work on its own region. For simplicity, assume that n is a multiple of r, 
such that each region has n/t elements. Let's look at some of the different ways 
that multiple threads might work on their assigned regions in parallel. 

The simplest and most straightforward option is to have the threads sum into 
a shared global variable that is protected by a mutex. Figure 12.31 shows how we 
might implement this. In lines 28-33, the main thread creates the peer threads 
and then waits for them to terminate. Notice that the main thread passes a small 
integer to each peer thread that seryes as a unique thread ID. Each peer thread 
will use its thread ID to determine which portion of the sequence it should work 
on. This idea of passing a small unique thread ID to the peer threads is a general 
technique that is used in many parallel applications. After the peer threads have 
terminated, the global variable gsum contains the final sum. The main thread then 
uses the closed-form solution to verify the result (lines 36-37). 

Figure 12.32 shows the function that each peer thread executes. In line 4, the 
thread extracts the thread ID from the thread argument and then uses this ID.to 
determine the region of the sequence it should work on (lines 5-6). In lines 9-13, 
the thread iterates over its portion of the sequence, updating the shared global 
variable gsum on each iteration. Notice that we are careful to protect each update 
with P and V mutex operations. 

When we run psum-mutex on a system with four cores on a sequence of size 
n = 29! and measure its running time (in seconds) as a function of the number of 
threads, we get a nasty surprise: 


Number of threads 
Version 1 2 4 8 16 
psum-mutex 68 432 719 552 599 


Not only is the program extremely slow when it runs sequentially as a single 
thread, it is nearly an' order of magnitude slower when it runs in parallel as 
multiple threads. And the performance gets worse as we add more cores. The 
reason for this poor performance is that the synchronization operátions (P and V) 
are very expensive relative to the cost of a'single memory update. This highlights 
an important lesson about parallel programming: Synchronization overhead is 
expensive and should be avoided if possible. If it cannot.be avoided, the overhead 
should be amortized by as much useful computation as possible. 

‘One way to avoid synchronization'in ourexample program is to have each 
peer thread compute its partial sum in a private variable that is not shared with 
any other thread, as shown in Figure 12.33. The main thread (not shown) defines 
a global array called psun, and each peer thread i accumulates its partial sum in 
psun [i]. Since we are careful to give each peer thread a unique memory location 
to update, it is not necessary to protect these updates with mutexes. The only 
necessary synchronization is that the main thread must wait for all of the children 
to finish. After the peer threads have terminated, the main thread sums üp the 
elements of the psum vector to arrive at the final result. 
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; code/conc/psum-mutex.c 


1 #include "csapp.h" ` 

2 #define MAXTHREADS 32 te r 

3 

^ void *sum_mitex(void *vargp); /* Thread routiie »*/ " 
5 Que 

6  /* Global shared variables */ (a 
7 long gsum = 0; /* Global sum */ 

8 long nelems per thread; /* Number of elements to sum */ 

9 sem t mutex; /* Mutex to protett global sum */ " 
10 i e 
11 int main(int argc, char **argv) 

12 1 

13 long i, nelems, log nelems, nthreads, myid[MAXTHREADS] ; 

14 pthread, t tid [MAXTHREADS] ; 

15 

16 /* Get input arguments */ = 

17 if (argc != 3) { 

18 printf("Usage: %s <nthreads> <log_nelem¥s\n", argví0]); 

19 exit (0); 

20 } " 

21 nthreads = atoi(argv[1]); 

22 log:nelems = atoi(argv[2]); 

23 nelems = (1L << log nelems); ` 
24 nelems_per_thread = nelems / nthreads; 

25 sem_init(&mutex, 0, 1); 

26 

27 /* Create peer threads. and?wait for them to finish */ \ 

28 for (i = 0;'i < nthreads; i++) { 

29 myid[i] = i; 

30 Pthread_create(&tid[i], NULL, sum mutex, &myid[il); 

31 } 

32 for (i = 0; i < nthreads; i++) 

33 Pthread_join(tid[i], NULL); 

34 Ri 

35 /* Check final answer */ 

36 if (gsum != (nelems * (nelems-1))/2). 

37 printf("Error: result=%ld\n", gsum); 

38 D 1 16 

39 exit (0); 

4 } 





code/conc/psum-mutex.c 


Figure 12.31 Main routine for psum-mutex. Uses multiple threads to sum the elements 
of a sequence into a shared global variable protected by a mutex. 
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code/conc/psum-mutex.c 


1  /* Thread routine for psum-mutex.c */ 

2 void *sum mutex(void *vargp) 

3 (t 

4 long myid = *((long *)vargp); /* Extract the thread ID */ 
5 long start = myid * nelems per thread; /* Start element index */ 
6 long end = start + nelems, per thread; /* End element index */ 

7 long i; 

8 

9 for (i = start; i < end; i++) { 

10 P(&mutex) ; 

n gsum += i; 

12 V(Emutex); 

13 + : 

14 return NULL; 

15 } 


code/conc/psum-mutex.c 





Figure 12.32 Thread routine for psum-mutex. Each peer thread sums into a shared global variable protected 
by a mutex. 





code/conc/psum-array.c 


1  /* Thread routine for psum-array.c */ 

2 void *sum array(void *vargp) 

3 ( 

4 long myid = *((long *)vargp); /* Extract the thread ID */ 
5 long start = myid * nelems, per thread; /* Start element index */ 

6 long end = start + nelems per thread; /* End element index */ 

7 long i; 

8 

9 for (i = start; i < end; i++) { 

10 psun[myid] += i; 

11 } 

12 return NULL; : 
13 ¥ 


code/conc/psum-array.c , 





Figure 12.33 Thread routine for psum-array. Each peer thread accumulates its partial surn in a private 
array element that is not shared with any other peer thread. 
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When we run psum-array on our four-core system, we ss that it runs orders 
of magnitude faster than psum-mutex: 
~ | } 
4 EN | 
Number of threads ii i 
Versioh 1 2 4 8 16 E i 
psum-mutex 68.00 | 43200 719.00 552.00 | 599.00 i 
psum-array 7.26 3.64 1.91 1.85 1.84 
In Chapter 5, we learned how to use local variables to eliminate unnecessary 
memory references. Figure 12.34 shows how we can ‘apply this principle by having 
each peer thread accumulate ts partial sum into a local, variable rather than s 
a global variable. When, we run psum~local on our four-core machine, we get j I 
another order-of-magnitude decrease jn running time: | 
i 
Number of threads | 
Version | Sur m E 8 7! 16 | l | 
psun-mutex 68.00 432.00 71900 55200 599.00 . | 
psum-array 7.26 3.64 1.91 1.85 1.84 { 
psum-local 1.06 0,54. 0.28 -0.29 0.30 b 
4 
4 t H st 
afa 14 a f i 
1 Fo j£ i 
Se er us code/conc/psum-local.c 
1 ./* Thread.rontine,for psum-local.'c */ ' i 
2 void *sum local(void *vargp) 
3 £r al s "as if " 
4 ldng^myid = *((long *)várgp); /* Extract) the! thread ID */ f 
5 ! " long stárt = myid * nelems per thread; /*:Btart elément index */ 
6- long end = start + helems, per thread; /* End elemeht index */ 
7* loag iy’sum:= 6; 5 i 
8 i D 
"9 for (i = start; i <"énd; iei ^ à 
10 sum += i; i 
n : i i ^O i 
12 psum[myid] = sum; j ] 
a MI 
13 return NULL; i T. "T 
14 Jj 
= 7 code/conc/psum-local.c 





Figure 12.34 Thread routine for psum-1ocal. Each peer thread accumulates its partial sum in a local 


variable. u 
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Figure 12.35 


Performance of psum- 
local (Figure 12. 34). 

Summing a sequerice of 
23! elements using four 


processor cores. 


Elapsed time (s) 





Threads 


An important lesson to take away from this exercise is that writing parallel] 
programs is tricky. Seemingly-small changes to the code have a significant impact 
on performance. P 

3 


Characterizing the Performance of Patallel Programs 


Figure 12.35 plots the total elapsed running time of the psum-1ocal program in 
Figure 12.34 as a function of the number of threads. In each case, the program 
runs on a system with four processor cores and sums a sequence of n — 23! ele- 
ments. We see that running time decreases as we increase the number of thitéads, 
up to four threads, at which point it levels off and even starts to increase a 
little. Y , * 

In the ideal case, we would expect the running time to decrease linearly with 
the numberof-cores. That is, we would expect running time to drop by half each 
time we double the number of.threads. This is indeed the case until we reach 
the point (t > 4) where each of the four cores.is busy running at least one thread. 
Running time actually increases a bit as we increase the number of threads because 
of the overhead of context switching multiple threads on the same core. For this 
reason, parallel programs are often written so that each core runs exactly one 
thread. 

Although absolute running time is the ultimate measure of any program’s 
performance, there are some useful relative measures that can provide insight 
into how well a parallel program is exploiting potential parallelism. The speedup 
of a parallel program'is typically defined as 


m Ds n I 
where p is the number of processor cores and 7; is the running time on k cores. This 
formulation is sometimes referred to as strong scaling. When T; is the execution 


T 
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Threads (t) 1 2 4 8 16 


Cores (p) " 1 2 4 4A 

Running time (7,) — 1.06 0.54 028 029 030 
Speedup (Sp) 1 1.9 3.8 3.7 3.5 
Efficiency (E,) 100% 98% 95% 91% 88% 


Figure 1 12, 36 Speedup and parallel efficiency for the execution times in 
Figure, 12. 35. 


me 
i H, t 


time of a sequential version of the program, then 5, is called the absolute speedup. 
When 7; is the execution time of the parallel version of the program running on 
one core, then $, is called the relative speedup. Absolute speedup i is'a truer mea- 
sure of the benefits of parallelism than relative spéedup. Parallel programs often 
suffer from synchronization overheads, even when they run on one processor, and 
these overheads can artificially inflate the relative speedup numbers because they 
increase the size of the numerator, On the other hand, absolute speedup is mote 
difficult to measure than relative speedup because measuring absolute speedup 
requires two different versions of the program. For complex parallel codes, creati 
inga separate sequential version might not be feasible, either because the code is 
too complex or because the source code is not available. 

A-related measure, known as efficiency, is defined as 


E aa 
p7 ME x 
Pop, 


and is typically reported:as a percentage in the range (0, 100]. Efficiency is a mea- 
sure of.the overhead, due to parallelization. Programs with high efficiency, are 
spending more time doing useful work and less time synchronizing and commu- 
nicating than programs with low efficiency. 

Figure 12.36 shows the, different speedup and efficiency measures for our 
example parallel sum program. Efficiencies over 90 percent such as these are very 
good, but do not be fooled. We were able to achieve high efficiency because our 
problem was trivially easy to parallelize. In practice, this is not usually the case. 
Parallel programming has been an active area of research for decades. With the 
advent of commodity multi-core machines whose core count is doubling every few 
years, parallel programming continues to be a deep, difficult, and active area of 
research. h 

There i is, another view of speedup, known as weak scaling, which increąses 
the problem size along with the number of processors, such that the amount of 
work performed on each processor, is held, constant as the number of processors 
increases, With this formulation, speedup and.efficiency are expressed jn terms 
of the total amount of work accomplished per unit'time. For example, if we can 
double the number of processors and do twice the.amount.of work per hour, then 
we are enjoying linear speedup and 100 percent efficiency. 
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Weak scaling is often a truer measure than strong scaling because it more 
accurately reflects our desire to use bigger machines to do more work. This is 
particularly true for scientific codes; where the problem size can be easily increased 
and where bigger problem sizes translaté directly to better predictions of nature. 
However, there exist applications whose sizes are not so easily increased, and for 
these applications strong scaling is more appropriate. For example, the amount of 
work performed by real- time signal-processing applications i is often determined 
by the properties of the physical sensors that are generating the signals.. Changing 
the total amount of work requires using different physical sensors, which might riót 
be feasible or necessary. For these applications, we typically want to use parallelism 
to accomplish a fixed amount of work as quickly as possible. 





Fill:in the: xilanks for: the naralial program in ithe following table. Assume Strong 


scaling. 

Threads (1) 1 2 p 4 : 
Cores (p) 1 2 4 

Running time {T,) 12 

Speedup (Sp) De 1.5 See WS 

Efficiency (E,) 100% — 50% 





12.7 Other Concurrency Issues 


You probably noticed that life got much more complicated once we were asked 
ta synchronize accesses to shared data. So-far, wé háve looked at techniques for 
mutual exclusion and producer:consumer "'syncht'dhizatioh, but this is only the tip 
of the iceberg, Synchronization i is a fuhdamentally difficult problem that raises 
issues that simply do not arise in ordinary sequential programs: This section is a 
survey (by no means complete) of some of the issues you need to be, aware of 
when y you write concurrent programs. To^keep things concrete, we will couch oür 
discussion in terms of threads. Keep in'mind, however, that these are typical. of the 


issues that arise when concurrent flóws of any kind manipuláte shared resources. 
t 


f 


12.7.1 Thread Safety 


When we program with threads, we must be careful to write functions that have a 
property called th?éád safety. A furiction is said to be thr&ad-safe if and only if it will 
always produce correct results when called repeatedly from multiple concurreht 

thréads. If a function isnot thread-safe, then we'say it is thread-unsdfe. 
We can identify four (nondisjoint) classes of thread-urí3afe functions: 

3 A h 

Class 1: Functions. that ‘donot protect shared- variables. We have already en- 
countered this problem with the thread furiction in Figure 12.16, which 


w 
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M ——————— — —— —— code/conc/rand.c 
1 unsigned next 5eed = 1; 5 
2 
3 /* rand - return pseudorandom integer in the range 0..32767 */ rx 
4  unsignéd rand(void) 
5 i "n 
6 next_seed = next seed*1103515245 + 12543; up 
7 return (unsigned) (next_seed>>16) "X 32768; q 
8 } * 
9 
10  /* srand - set the initial seed for rand() */ " 
1! void srand(unsigned new_seed) 
13. ^ next: séed'— new, seed; n h 
14 } uj- L F 
; 
code/conc/rand.c 


Figure 12.37 A thread-unsafe pseudorandom number generator. (Based’on [61]) 


vij cH mal 4 F 


increments an unprotęfted global counter "Variable. This class of thread- 
unsafe functions is relatively easy to make thread-safe: protect the shared 
varjables with synchronization operations such as P and V. Án advantage 
is that it does not réquire any chan ges in the callin g piogram. A disadvan- 
tage is that the synchronization operations slow down the fungtion. 


Class X; Functions that keep state across multiple invocations. A pseudórandom 
ı number generator is asimple example of this class of thread-unsafe func- 
. tions. Consider the pseudorandom number generator package in Fig- 
ure 12.37. , 
| — Therand function is thread-unsafe because the result-of the current 
z invocation depends on an.intermediate result from the previous iteration. 
When we call rand repeatedly from a single thread after seeding it with a 
{ call to srand, we.can expect a repeatable sequence of numbers. However, 
"this assumption no longer.holds if multiple threads-are calling rand. 
‘The only way to make a function such as rand thread-safe is to rewrite 
it so that it does not use any static data, relying instead on the caller 
to pass the state information in arguments. The disadyantage i is that the 
programmer is now forced to change the code in the calling routine as 
well: In alarge program where there are potentiàlly hundreds of different 
call sites, making, such modifications could be nontrivial and ¿prone to 
error. 


J: +n . 
Class 3» Functions.that'return a! pointer«tata"static variabler Some:functions, 
, suchas ctime arid gethostbyname, Cómpute a result ina Static variable 
and then returna pointer to that variable. If we.call süch functions from i 
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code/conc/ctime-ts.c 






char *ctime ts(const time t *timep, char *privatep) 


1 







char *sharedp; 






P(£mutex); 
sharedp = ctime(timep); 

strcpy(privatep, sharedp); /* Copy string from shared to private */ 
V(&mutex) ; 
return privatep; 







Oo 0 ^4 DOH SU DN = 








code/conc/ctime-ts.c 


Figure 12.38 Thread-safe wrapper function for the C standard library ctime function. This example 
uses the lock-and-copy technique to call a class 3 thread-unsafe function. 






concurrent threads, then disaster is likely, as results being used by one 
thread are silently overwritten by another thread. 

There are two ways to deal with this class of thread-unsafe func- 
tions. One option is to rewrite the function so that the caller passes the 
address of the variable in which to store the results. This eliminates all 
shared data, but it requires the programmer to have access to the function 
source code. 

If the thread-unsafe function is difficult or impossible to modify (e.g., 
the code is very complex or there is no'source code available), then an- 
other option is to use the lock-and-copy technique. The basic idea is to 
associate a mutex with the thread-unsafe function. At each call site, lock 
the mutex, call the thread-unsafe function, copy the result returned by 
the function to a private memory location, and then unlock the mutex. 
To minimize changes to the caller, you should define a thread-safe wrap- 
per function that performs the lock-and-copy and then replace all calls 
to the thread-unsafe function with calls to the wrapper. For example, 
Figure 12.38 shows a thread-safe wrapper for ctime that uses the lock- 
and-copy technique. 


Class4: Functions that call thread-unsafe functions. If a function f calls a thread- 
unsafe function g, is f thread-unsafe? It depends. If g is a class 2 function 
that relies on state across multiple invocations, then f is also thread- 
unsafe and there is no recourse short of rewriting g. However, if g is a 
class 1 or class 3 function, then f can still be thread-safe if you protect 
the call site and any resulting shared data with à mutex. We see a good 
example of this in Figure 12.38, where we use lock-and-copy to write a 

thread-safe function that calls a thread-unsafe function. : 
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Figure, 2.39 All — 
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functions | 
code/conc/rand-r.c i 
1  /* rand.r - return a pseudorandom integer on 0..32767 */ | 
2 int rgnd r(unsigned int: *nextp) ^ D 
3 t , 
4 *nextp = *néxtp * 1103515245 -'12345; | 
s ''’ retuíh «(unsigned int)(*nextp / 65536) % 32768; | 
6 } ? ` "u i 
€ L] L 
F LPs - code/conc/rand-r.c 
Figure 12.40 rand. r: A reerttrant version of the rand function from Figure’12.37. : 
by 2 
LI 2 [ 
12.7.2 Beentranor t 
There i is gnimportant class of thread-safe functions, known as reentrant functions, 





that are characterized by the property that they do not reference any shared data 
when they are, called by multiple , threads. Although the terms thread-safe and 
reentrant are sometimes used (incorrectly), as synonyms, there, is a clear technical 
distinction that is, worth preserying. Figure 12. 39 shows the, set relationships be- 
tween reentrant, thread;safe, and thread-unsafe functions. ‘The set of all functjops 
is partitioned into the disjoint sets of thread-safe and thread-unsafe 1 ‘functions. The 
set of reentrant functions is a proper subset of the thread-safe functions. 

Reentrant functions are typically more efficient than non- -reentrant thread- 
safe functions because they require no synchronization operations. Furthermore, 
the only way to convert a class 2 thread-unsafe function into a thread-safe one is 
to rewrite it so that it,is reentrant. For example, Figure 12.40 shows a reentrant 
version of the rand function from Figure 12:37. The key' idea is that we have 
replaced the static next variable with a pointer that is passed in by the caller. 

Isit possible to inspect the code of some function and declare a priori that it is 
reentrant? Unfortunately, it depends. If all function arguments are passed by value 
(i.e., no pointers) and all data references are td local automatic stack variables (i.e., 
no references to stafic or global variables), then the function is explicitly reentrant, 
in the sense that we can assert its reentrancy regardless of how it is called. 

However, if we loosen our assumptions a bit and allow some parameters in 
our otherwise explicitly reentrant function to be passed by reference (i.eswe 
allow them to pass pointers), then we have an implicitly reentrant function, in 
the sense that it is only reentrant if'the calling‘threads are carefülto pass pointers 
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to nonshared data. For example, the rand_r function in Figure 12.40 is implicitly 
reentrant. r» 

We always use the term reentrant to jnclude both explicit and-implicit re- 
entrant functions. However, it is important to realize that reentrancy is sometimes 
a property of both the caller and the callee, and not just the callee alone. 





The c ctime_ts facon] in Figure 12.38 is thread-safe but not reentrant. Explain. 


) 


12.7.3 Using Existing Library Functions in Threaded Programs 


Most Linux functions, including the functions:defined in the standard C library 
(such as malloc, free, realloc,.printf, and'scanf), are thread-safe, with only 
a few exceptions. Figure 12.41 lists some common exceptions. (See [110] for a 
complete list.) The strtok function is a deprecated function (one whose use is 
discouraged) for parsing strings. The asctime, ctime, and localtime functions 
are popular functions for converting: back and forth between;different time and 
date formats. The-gethostbyaddr, gethostbyname, and inet ntoa functions 
are obsolete network programming functions that have been replaced by the 
reentrant getaddrinfo,getnameinfo, and inet, ntop functions, respectively (see 
Chapter 11). With the exceptions of rand and strtok, theyare of the class 3 variety 
that return a pointer to a static variable. If we need to éall one of these 'fühctions ih 
a threaded program, the least disruptive approach to the caller'is to lock and copy. 
However, the lock- àñd- -copy approach has a number of disadvantages. First, the 
additional synchronization slows down the program. Second, functions that return 
pointers to complex structures of structures require a deép copy’ of the Structures 
in order to ‘copy the entire structure Hierarchy. Third, the lock“ -and-copy approach 
will not work for a class 2 thréad-unisafe function such as rand that relies'on static 
state across calls. 


Ur 


a 
aot 


Thread-unsafe function Thread-unsafe class pur thread- pue version 


rand : $7 rand r 

strtok 2 strtok r a 

asctime 3 asctime_r : 

ctime ` 3 ctime r  , 

gethostbyaddr T 3 gethostbyaddr_r 
gethostbyname , 3 gethostbyname_r wa 
inet_ntoa 3 (none) 

lotcaltime ! 3 localtime_r 

! x Hs p. pt sy 


Figure,12.41 .Cqmmon thread-unsafe library-functiọns. r 
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Therefore, Linux systems provide reentrant versions of most thread-unsafé 
functions. The names;of the reentrant versions always end with the _r suffix. For 
example, the reentrant version of asctime is called asctime, r.We recommend 
using these functions whenever possible. 


12.7.4 Races 


A race occurs when the correctness of a program depends on one thread reaching 
point x in its control flow before another thread reaches point y..Races usually 
Occur because programmers assume that threads will take some particular trajec- 
tory through the execution state space, forgetting the golden rule:that thfdàded 
programs mustiwork correctly for any feasible trajectory: 

Atvexample is the easiest way to understand the natuYe-of races. Consider the 
simple program in-Figure 12.42. The ‘main-thread createWfour peer threads and 
passes a pointer to a‘unique integer ID tó each one. Each p&er thread copies the 

» i "t 


an 


code/cónc/racé.c 





1 7* WARNING: This code is buggy! */ vus * 

2 #indlhde "cédphoh"! 

3 #define N 4 i 
4 M " D 

s! void *thread(void *vargp); > lis 

el a 4 

7 int main() É 

8 £ PEG , 

g pthread, t tíd[N]; 

10 int i; 

nu 

12 for (i = 0; i < N; i++) 

13 Pthread_create(ztid[i], NULL, thread, ki); d 
14 for (i = 0; i < N; i++) A 

15 Pthread_join(tid[i], NULL); 

16 exit(0); x m 4 ` 4 
Ww p I a 1 ^ 4 

18 " 

19  /* Thread routine */ E ors 

20 , void *thread(void *vargp) 

a ( 

22 int myid = *((int *)vargp); " 

23 printf("Hello from thread 4dMt'"i"myid)! 

24 return NULL; 

25 it} uU 2 vs 
———————————————————— — à —-tode/conc/race.c 


Figure 12.42. A progtam with a race. 
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ID passed in its argument to a-local variable, (line 22) and then prints a message 
containing the ID. It looks simple enough, but.when we run this programon our 
system;-we get the following incorreét result: "n no r2 » 

I 
linux> ./race i 
Hello from thread 1 t 
Hello from thread 3 
Hello from thread 2 2i 
Hello from thread 3 » n, 


t i 

The problem is caused by;a, race between each peer thread and.the main 
thread. Can you spot the race? Here is,what happens. When the main thread 
creates a peer thread, in line 13, it passes a pointer to the:loca] stack.variable 
i. At this point, the race is on between the next increment; of i in line, 12 and 
the dereferencing:and assignment.of the argument in line;22..]f the peer thread 
executes line 22 before the main thread increments i in line 12, then the myid 
variable gets the correct ID. Otherwise, it will contain the ID of some other thread. 
The, scary thing is that whether we get the correct answer depends.on how the 
kernel schedules the execution of the threads. On our system it fails, but on other 
systems it might work correctly, leaving the programmer blissfully unaware of a 
serious bug. 

To eliminate the race, we can dynamically allocate a separate block for each 
integer ID and pass the thread routine a pointer to.this block, as shown, in Fig- 
ure 12.43 (lines 12-14). Notice that the thread routine must free the block in order 
to avoid a memory leak. 

When we run this program on our system, we now get the correct result: 


linux? ./norace 

Hello from thread 0 
Hello fróm thread 1 
Hello from thread 2 
Hello fxom thread 3 





In Figure 12. 43, we might be tempted to free the allocated memory block immedi- 
ately after line 14 in the main thread, instead of freeing it in the peer thread. But 
this would be a bad idea. Why? 





A. hs Figure 12. Up. we PARA the race by allocating à separate block for 
each integer ID. Outline a different approach that does not call the malloc 
or free functions. 2 NR 


B. What are the advantages and disadvantages of this approach? H WA 
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as ES 


FREE 


x = Code/conc/norace.c 
1 #include "csapp.h" , ea cies ~ " | 
2 #define N 4 } 
3 l ' | 
4 void *thread(void *vargp) ; E 1 
5 , ; 
6 int main() | 
7 i $ 
8 pthread_t tid[N]; | 
9 int i, *ptr; 
10 
11 for (i = 0; i < N; i++) 1 à | 
12 ptr = Malloc(sizeof(int));  * i i 
13 *ptr = i; TS | 
14 Pthread, create(&tid[i], NULL, thread, ptr); i 
15 } E - i 
16 for (i = 0; i < N; i++) ne | 
7 Pthread join(tid[i], NULL); "' ja- | 
18 exit (0); os A $ ' 
is . 

/* Thread routine */ 

void *thread(void *vargp) 

i 

int myid = *((int *)vargp); 

25: ‘Free(vargp); . ; 
26 printf("Hello from thread %d\n", myid); i | 
27 :, return NULL;:: or E is nn 
28 } 1 id 
a a TEMA 9M rum .code/canc/norace.c | 


t £ * E EPIS at $ 
Figure 12.43 A correct version of thé program in Figure 12.42 without a race. 
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12.7.5 Deadlocks" " : : UE 


i à rE’ 6o 

Semaphores introduce the'potential for a nasty kind of run-time error, called 

deadlock, whete a collection of threads is blocked, waiting*for a' condition that 

Will never be true. The: progress graph ‘is‘an invaluable:tool for ‘understanding 

deadlock. For'exaniple; Figure 12.44'shows the progress graph for a pair of threads 

that use two semaphores for mutual exclusion. From this graph, we'can glean some 
important insights about deadlock: 

2 ic * f P Yalld t i 

¢ The-programmer has incorrectly ordered the, P.and V operations such! that 

" the forbidden regions fortthe. two semaphores overlap. If some»execution 

trajectory happens to reach the deadlock state d, then no further progress is 
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Thread 2 
A trajectory that does not deadlock 






vis) 


V(t) 


Lf E DU A 
Deadlock 
P(s) state 
d 


* 


Deadlock 
region 


Initially 


s=1 


i A trajectory that deadlocks 





Thread 1 
Pi) .- P - W9 -- VO 


Figure 12.44 Progress graph for a program that can deadlock., 


possible because the overlapping forbidden regions block progress in every 
legal direction. In othér words, the program is deadlocked because each 
thread is waiting for the other to do a V operation that will never occur. 


* The overlapping forbidden regions induce a set of states called the deadlock 
region. If a trajectory happens to touch-a state in the deadlock region, then 
deadlock is inevitable. Trajectories can enter deadlock, regions, but they can 
néver leave. i J H 

* Deadlock is an especially difficult issue because it is not always predictable. 
Some lucky execution trajectories will skirt the deadlock region, while others 
will be trapped by it. Figure 12.44 shows an example of each. The implications 

» fora programmer are scary. You might run the same program a thousand times 
without any problem, but ther the next time it deadlocks. Or«the program 
might work fine on one machine but deadlock on,another. Worst of all, 
the error is oftei-not repeatable because. different executions have different 
trajectories. “oa 1 


Programs deadlock for many reasons, and preventing them is a difficult prob- 
lem in general. However; whén binary semaphores are used for mutual exclusion, 
as in Figure 12.44, theft you can apply the following simple and effective rule to 


prevent deadlocks: AG "E y o^ ; 
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Thread 2 I 


til 





Initially 






s=1 
t=1 





* Lp Thread 1 
r ee P(S) -~ P(ü e V(s) e Vt)» 56 o * 


Figure 12.45 Progress graph fora deadlock-free program. 
s f tr 
D, i te E 1 
Mutex lock ordering rule: Given'artotal ordering, of all mutexes, a program is 


deadlock-free if each thread acquires its, mutexes in order and releases:them in 
reverse order. t. à Af ' 


For.example, we can fix the deadlock in Figure,|?.44 by locking,s first, then t, 
ingach thread. Figure 12.45 shows the resulting progress graph. 
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" 1 i mu D ^" * 
Ihitially: s = 1, t =-0% . ; i 
Thread 1: Thread 2: 
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V(t); us Vv); t bv 
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A. Draw the progress graph for this program. $ r 
B. Does it always-deadlock? f ; 
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C. If so, what simple change to the initial semaphore values will eliminate the 
potential for deadlock? 


D. Draw the progress graph for the resulting deadlock-free program. 


12.8 Summary 


Aconcurrent program consists of a collection of logical flows that overlap in time. 
In this chapter, we have studied three different mechanisms for building concur- 
rent programs: processes, I/O multiplexing, and threads. We used a concurrent 
network server as the motivating application throughout. 

Processes are scheduled automatically by the kernel, and because of their 
separate virtual address spaces, they require explicit IPC mechanisms in order 
to share data. Event-driven programs create their own concurrent logical flows, 
which are modeled as state machines, and use I/O multiplexing to explicitly sched- 
ule the flows. Because the program runs in a single process, sharing data between 
flows is fast and easy. Threads are a hybrid of these approaches. Like flows based 
on processes, threads are scheduled automatically by the kernel. Like flows based 
on I/O multiplexing, threads run in the context of a single process, and thus can 
share data quickly and easily. 

Regardless of the concurrency mechanism, synchronizing concurrent accesses 
to shared data is a difficult problem. The P and V operations on semaphores have 
been developed to help deal with this problem. Semaphore operations can be used 
to provide mutually exclusive access to shared data, as well as to schedule access to 
resources such as the bounded buffers in producer-consumer systems and shared 
objects in readers-writers systems. A concurrent prethreaded echo server provides 
a compelling example of these usage scenarios for semaphores. 

Concurrency introduces other difficult issues as well. Functions that are called 
by threads must have a propetty kriown as thread saféty. We have idéntified 
four classes of thread-unsafe functions, along with Suggestions for making the 
thread-safe. Reentrant functions are the proper subset of thread-safe functions 
that do not access any shared data. Reentrant functions are often more efficient 
than non-reentrant, functions, because they do not require any synchronization 
primitives. Some other difficult issues that arise in concurrent programs are races 
and deadlocks. Races occur when programmers make incorrect assumptions about 
how logical flows are scheduled. Deadlocks occur when a flow is waiting for an 
event that will never happen. 
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12.16 € ! s f AE 
Write a version of hdllo.¢ (Figure 12.13) that creates and'reaps n joinable PET 
threads, where n is a command-line argument. 


12.17 € = 
A. The program in Figure 12.46 has a bug. The thréad ‘is súppóšed to sleep for 
1 second and then print a string.'Hówever, when we run it on our system, 
nothing ‘prints’ Why? 


B. You. can fix this bug by replacing the exit function in line 10 with one of two 
different Pthreads function calls. Which ones? 


* e 
ue 1 
code/conc/hellobug.c 
1 /* WARNING: This code is buggy! */ 
2 #include "csapp.h" 
3 void *thread(void *vargp); ji u H RA thts s 
4 “ił t « kd iJ. 
5 int main) 
6 (t 
7 pthread t tid; ; 
8 
9 Pthread_create(&tid, NULL, thread, NULL); S 
10 exit(0); 
11^ 3 a r 
12 J 
13 /* Thread routine */ 
14 void *thread(void *vargp) 
wii ‘ T 
16 Sleep(1); 
17 printf("Hello, world!\n"); 
18 return NULL; 
19 } v 
ot 3 ^ 1 f " 
: codecanchellóbus e 
j Ji " r " ee 1 
Figure 12.46 Buggy program for Problem 12.17. Do: 





1032 Chapter 12: Concurrent Programming 


12:18 @ 
Using. the progress graph im Figure 12.21, classify the following trajectories as 
either safe ór unsafe. i » 


i dou on 

A. Hy, Ly, U2, Hy Ly, $2, Uy St, Ty Th 

B. Ap, Hy, I, Ui $ Lz Tr Uz Sa, T) 

C. Hy, Ly, Ho, Ly, Uz, $5 Ui, Sy, Ty, Th ? 5 : 

f f 4 he 

12.19 O@ 

The solution to the first readers-writers problem in Figure 12.26 gives.a somewhat 
weak priority to readers because a writer leaving its critical section might Testart 
a waiting writer instead of a waiting reader. Derive a solution that gives strenger 
priority to readers, where a writer leaving its critical section will always restart .a 


waiting reader if one exists. ET 


P} 


12.20 49€ t 
Consider a simpler yarjant of the readeys-writers problem-wherg, there are at most 
N readers. Derive a solution that gives equal priority to readers and writers, inthe 
sense that pending readers and writers have an equal chance of being. granted 
access to the resource, Hint: You can solve this problem using a single counting 
semaphore anid, a singlé mutex. zi ; 

12.21 9999 

Derive a solution to the second readers-writers problem, which favors writers 
instead of readers. 


12.22 9 4 

Test your upderstanding of the select function-by modifying the server in Fig- 
ure 12.6 so that it echoes at most one text line per iteration of the main server 
loop. ae 


12.23 9*9 

The event-driven concurrent echo server in Figure 12.8 is flawed because'a mali- 
cious client can deny service to other clients by sending a-partial text line. Write 
an improved version of the server that can handle these partial text lines without 
blocking, 


12.24 9 
The functions in the Rio I/O package (Section 10.5) are thread-safe. Are they 


reentrant as well? is 


- 


12.25 € 

In the prethreaded concurrent echo server in Figure 12.28,-each thread calls the 
écho; crit function (Figure 12.29). Is echo. cnt thread-safe? Is it reentrant? Why 
or why not? a" 
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12.26 999 
Use the lock-and-copy technique to implement a thread-safe noh-reentrant ver- 

sion of gethostbyname called gethostbyname, ts. A correct solution will use a 

deep copy of the hostent structure protected by a mutex. 


12.27 9 by x 
Some network programming texts suggest the following approach for reading and 
writing sockets: Before interacting with the client, open twò standard I/O streams 
on the same open connected socket descriptor, one for reading and one for writing: 


FILE *fpin, *fpout; Vu 


fpin = fdopen(sockfd, "r"); 
fpout = fdopen(sockfd, "w"); M 


When the server finishes interacting with the client, close both streams as follows: | 
9 
fclose(fpin); | 


fclose(fpout) ; 
» f 


However, if you try this approach in a concurrent.server based on threads, 
you will create a deadly race condition. Explain. 


a 1 


1228 € bs " 
In Figure 12.45, does swapping the order of the two V. operations haye any effect 
on whether or not the program deadlocks? Justify your answer by drawing the 
progress graphs for the four possible cases: 








, Casel Case 2 Case 3 Case 4 
Thread1  Thread2 Threadi Thread 2 Thread1 Thread 2 Thread1 Thread 2 
P(s) P(s) P(s) P(s) P(s) P(s) P(s) P(s) 
P(t) P(t) P(t) P(t) - P(t) P(t) P(t) P(t) 
V(s) V(s) V(s) V(t) v(t) V(s) V(t) V(t) 
V(t) V(t) V(t) V(s) V(s) V(t) V(s) V(s) 
12.29 @ | 
Can the following program deadlock? Why or why not? ' 
é m E u 
Initially: a= 1, b=1, ¢= 1. tte 4 | 
Thread 1: Thread 2: | 
P(a); P(c); : 
P(b); P(b); 
V(b); V(b); 
P(c); Vic); 
Vc); 
V(a) ; 1 š | 
l 
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12.30 € 
Gonsider the following program that deadlocks. i 
i 


i 
Initially: a = 1, b= 1, c= 1. 


Thread 1: Thread 2: Thread 3: 
Pla); 11 P(c); P(c); a 
‘P(b); ‘P(b); v(e); 
V(b); V(b); P(b); oOo : 
P(c); Vc); P(a); 
Vic); P(a); v(a); $ 
Va); v(a); V(b); 


2r 
A. For each thread, list the pairs of mutexes that it holds simultaneously, 


B. Ifa « b « c, which threads violate the mutex lock ordering rule? 


C. For these threads, show a new lock ordering that guarantees freedom from 
deadlock. 


1231 999 

Implement a version of the standard I/O fgets function, called t£gets, that times 

out and returns NULL if it does not receive an input line on standard input within 

5 seconds. Your function should be implemented in a package called tfgets- 

proc.c using processes, signals, and nonlocal jumps. It should not use the Linux 

alarm function. Test your solution usitig the driver pógram in Figure 12.47. — 
P )^ 4 


py t i r 


- - code/conc/tfgets-main.c 
1 #include "csapp.h" 
2 ‘ 
3 char *tfgets(char *s, int size, FILE *stream) ; P 
! 4, 
5 int main() 
6 t1 
7 char buf [MAXLINE] ; 
8 yr 
9 if (tfgets(buf, MAXLINE, stdin) == NULL) 
10 printf ("BOOM! \n") ; 
11 else 
12 printf("As", buf); 
13 
14 exit(0); A 
15 } ` 
————— c_., im  ——  code/conc/tfgets-main.c 


Figure 12.47 Driver program for Problems 12.31-12.33. 
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1232 99€ y ‘ 
Implement ‘a. vérsion of.the tfgets function from Problem 12.31 :that' uses the f 
sèlect function. Your ‘function should be implemented in a. package; called 

tigets-select.¢.,Test your solution. using the driver. program from Problem 
12.31. You may assume that standard input is'assigned' to descriptor 0. 


Implement a threaded version of the tfgets function from Problem 12.31. Your 
function should be implemented in a package galled tfgets-thread:c.;Jest your 
solution using the driver program from Problem 12.31. 

any s "Ie v aiv 
12.34, 999 o Lani 
Write a parallel t threaded. version of an Nix M matrix multiplication kernel. Com; 
pare the performance to ‘the sequential case. tg 1 


2 Ë 2a 1 


1235 O00 , 
| 








* 


AN fi. » 
Implement a concurrent version of the Tiny Web server based on processes. Your 
solution should create a new child process for each new connection request. Test 
your solution using, areal Web browser. 


1236 ee»  " IMP NEUES | 
Implement a concurrent version of the Tiny Web server based óh /O multiplexifig. | 

Test your solution using a real Web browser. | s B ' 
1237 e** Stat ln t 
Implement a concurrent'version of the Tiny’ Web server based on threads. Your 

solution should create a new thread for each new connection request. Test your 

solution using areal Web.browser.- o h 


à i [| 
12.38 9999 
Implement a concurrent prethreaded version of the Tiny Web segver.-Your solu: 


tion should dynamically increase or decrease the number o£threads in response to i 
the current load. One strategy is to double the, number < of threads when the buffex H 
becomes full, and halve the number of threads when the, buffer becomes empty: | 
Test your solution using a real Web browser. l 
.39€25 sf TED , 

12.39 999€ i j 
j 


A Web proxy is a program that acts as a middlerhan between a Web server and 

browser. Instead: Y pen the server directly t to get a eb page, the ‘browser | 
contacts the proxy. Which forwards the request t to the server, , When the server 
replies to the proxy, the proxy sends the reply to the browser. For this lab, you will | 
write a simple Web proxy that filters and logs requests: i 


A. In the first part of the lab, you will set up-the proxy to accept requésts, parse 


the HTTP, forward the requests to the server, and:teturn the results to the . | | 
browser. Your proxy should log the URLs of all requests in a log file on disk, ^ 
and it should;also block requests to any‘URE contained ina filter file on 1 
disk. H 
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B. In the second part of the lab, you will upgrade your proxy to.deal with 

multiple open corrections at once by spawning.a'separate thread.to handle 

. each request. While your proxy.is waiting for. aremote server to respond to 

va request so that it canserve one browser, it should be working on a pending 
request from anóther browser. t 3 


Check your proxy solution using a real Web browser. 


à 


Solutiohs to Practice Problems 


Solution to Problem 12.1 (page 975) 
When the parent forks the child, it gets a copy of the connected descriptor, and 
the reférencé colint for the associated file table is incremented ‘from 1 to 2°When 
the parent closes its copy of the descriptor; the reference count is decremented 
from 2 to 1. Since the kernel will not close a file until the reference counter in its 
file table goes to 0, the child's end of the fonnection stays open. 
Solütion to Problem 12.2 (page 975) ? 
When a process terminates for any reason, the kernél closes all ópen descriptors. 
Thus, the child's copy of the connected file descriptor will be closed automatically 
when the child exits. 
fe 

Solution to Problem 12.3 (page 980) 
Recall that a descriptor is ready for reading if a request to read 1 byte from 
that descriptor would not block. If EOF becomes true on a descriptor; then the 
descriptor is ready for reading because the-read operation will return immediately 
with a zero return code indicating EOF. Thus, typing Ctrl+D causes the select 
function to return with descriptor 0 in the ready set. 

xf 
Solution to Próblem 12.4 (page 984) ^ ÉD 
We reinitialize the pool. ready. set variable before every call to 8&lect becaüse 
it serves ds both an input arid output atgiment. On input, it contairis the read set. 
On output, it contaifis thé ready set. ; ' is 


% 


Solution to Problem 12.5 (page 992) 
Since threads run in the game process, they all share the same descriptor ‘table’ No 
matter how many threads ‘use ‘the Connectéd descriptor, thé 'Yeference count for 
the connected descriptor’ S file table ig ise ual to 1. Thus a single close operation is 
sufficient to free the memory fesources associated With the connected descriptóf 
when we are through with it. 


í tod 


ri t 


Solution to Problem,12.6 (page 995) 

The main idea here is that stack variables are private, whereas global ‘and static 
variables are shared. Static variables such as cnt are a little tricky because the 
sharing is dimited sto the-functions within their scope—in this case, the thread 
routine. ^ 
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A. Here is the table: DUE. | 
te te teow { 1 
Variable!" iu! Referenced by il | 
instance main thread? peer thread’0?* peer thread 1? d 
ptr yes yes ' Y yes , | 
cnt no “yes yes i 
i.m yes , no no 
msgs.m yes yes yes l 
myid.pO no yes 2400 & 
myid.pi no no yes 1 
à 1 1 1 
Jd 
Notes: ME i 
J* btr A global variable that is written by the main thread and read by the : 
' è peer threads. wom de "y | 


crit ‘A statit variable with-only one instance in méniory that is read and, " 
written by the two peer threads. LE 
i.m. ld¢dl ‘automatic variable stored on the stack of the main thread. 
. Even though its value is passed to the peer threads, the peer threads } 
never reference it on the stack, and thus it is not shared. 
msgs.m A local automatic variable stored on the main thread's stack and 
referenced indirectly through ptr by both peer threads. 
myid.pO.and myid.pi Instances of alocal automatic variable residing on } 
athe stacks of peer threads 0 and 1; respectively. 


B! Variables n cnt, ánd nsgs are referenced by more thai one thread and i 
thus are sfiated. | 
Í - 
'$olution to Problem 12.7 (page 998) i : : d i 
The important idea here is that, you cannot make any ‘assumptjons about the i 
ordering hatt the kernel chooses when it schedules your threads. 


Step Thread Instr. 4rdx, %xaxy cnt | 
1 1 A "2 zs 0 i 
1 Li 0 = 0 " 

3 2 Ho = — 0 
4 2 Lo = 0 0 i 

5 3 Uz zs 1 0 
6 2 S2 = 1 1 
7 1 U 1 — 1 t 
8 1 5, 1 = 1 j 
9 1 7i 1 = 1 . 
10 2 T = 1 IN i 


Variable cnt has a final incorrect value of 1. 
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Solution to Problem 12.8 (page 1001) 

This problem is a simple test of your understanding of safe and unsafe trajectories 
in progress graphs. Trajectories such as A and C that skirt the critical region are 
safe and will produce correct results. 


A. H), Li, Ui, 51, Hz, L5, Uz, $5, T5, Ti: safe 
B. Ab, Lz, H4, Lı, Ui, $1, T, Up, S5, Tj: unsafe 
C. Ai, Ap, Lz, Uz, $5, Li, Ui, $1, Ti, Tz: safe 


Solution to Problem 12.9 (page 1006) 


A. p=1, c=1, n> 1: Yes, the mutex semaphore is necessary because the 
producer and consumer can concurrently access the buffer. 


B. p=1,c=1,n=1: No, the mutex semaphore is not necessary in this case, 
because a nonempty buffer is equivalent to a full huffer. When the buffer 
contains an item, the producer is blocked. When the buffer is empty, the 
consumer is blocked. So at any point in time, only a single thread can access 
the buffer, and thus mutual exclusion is guaranteed without ysing the mutex. 


C. p>1,c>1,n=1: No, the mutex semaphore is not necessary in this case 
either, by the same argument as the previous case: 


Solution to Problem 12.10 (page 1008) 
Suppose that a particular semaphore implementation uses a LIFO stack of threads 
for each semaphore. Whena thread blocks on a semaphore in a P operation, its ID 
is pushed onto the stack. Similarly, the V operation pops the top thread ID from 
the stack and restarts that thread. Given this stack implementation, an adversarial 
writer in its critical section could simply wait until another writer blocks on the 
semaphore before releasing the semaphore. In this scenario, a waiting reader 
might wait forever as two writers passed control back and forth. 

Notice that although it might seem more intuitive to use a FIFO queue rather 
than a LIFO stack, using such a stack is not incofrect and does ridt Violate the 
semantics of the P and V operations. 


Solution to Problem 12.11 (page 1020) 
This problem is a simple sanity check of your understanding of speedup and 
parallel efficiency: 


Threads (t) 1 2 4 
Cores (p) 1 2 4 
Running time (T,) 12 8 6 
Speedup (S,) 1 1.5 2 
Efficiency (E,) 100% 75% 50% 


Solution to Problem 12.12 (page 1024) 
The ctime_ts function is not reentrant, because each invocation shares the same 
static variable returned by the ctime function. However, it is thread-safe be- 
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cause the accesses to the shared variable are protected by P and: V operations, jo 
l and thus are mutually exclusive. i | 


Solution to Problem 12.13 (page 1026) 

If we free the block immediately after the call to pthread_create in line 14, then 
we will introduce a new race, this time between the call to free in the main thread 
and the assignment statement in line 24 of the-thread routine. — | 


Solution to Problem 12.14 (Page 1026) ; | 


A. Another approach is to pass the integer i directly, rather than passing a 
pointer to i: | 


E i 
for (i = 0; i <N‘ i++) : " 
Pthread creaté(&tid [i], NULL, thread, (void *)i); | | 
In the thread routine, wẹ cast the argument back to an int and assign it to l 
; myid: 4 
J 
int myid = (int) vargp; | 
B. The advantage is that it reduces overhead by elimiridting the calls to malloc i 


and free. A significant disadvantage is that it assumes that ointers are at 
Teast as large as ints. While this assumption "is'trheYor all modern Systeins, 
it might'not be true for legacy or future systems. | 


Solution to Problem 12.15: (page 1029) 








A. The progress graph for the original program is shown in Figure 12.48 on the b 
next page. | 
B. The program always deadlocks, since any feasible trajectory is eventually | 

trapped in a deadlock state. 3 
C. To eliminate the deadlock potential, initialize the binary semaphore t to 1 
instead of 0. 
D. The progress graph for the corrected program is shown in Figure 12.49. 
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Figure 12.48 Progress graph for, a program that deadlocks. 
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Progress graph for the corrected deadlock-free program. 








. Error Handling 


Programmers should always check the error codes returned by system-level func- 
tions. There are many subtle ways that things can go wrong, and it only makes 
sense to use the status information that the kernel is able to provide us. Unfortu 
nately, programmers are often reluctant to do error checking because it clutters 
their cdde, turning a single line of code into a multi-line conditional statement. 
Erro? checking is also éonfusing because different functions indicate errors in dif- 
ferent' Ways. 

We were faced with a similar problem wher writing this text. On the one hand, 
We would like our code examples to be concise and simple to read. On the other 
harid, we do not want to givelstudents the wrong impression that it is OK to skip 
error checking. To resolve thesé issues, we have adopted ah'approach based on 
error-handling wrappers that was pioneered by W. Richard Stevens in his network 
programming text [110]. 

The idea is that given some base system-level function foo,-we define a 
wrapper function Foo with identical arguments, but with the first letter capitalized. 
The wrapper calls the‘base function and checks for errors: If it detects an error, the 
wrapper prints an informative message and terminates the process. Otherwise, it 
returns to the caller. Notice that if there are no errors, the wrapper behaves exactly 
like the base function. Put another way, if a program runs correctly with wrappers, 
it will run correctly if we render the first letter of each wrapper in lowercase and 
recompile. l 

The wrappers are packaged in a single source file (cgapp. c) that is compiled 
and linked into each program. A separate header file (csapp.h) contains the 
function prototypes for the wrappers. 

This‘appendix gives a tutorial on-the different kinds of error handling in-Unix 
systems and gives. examples of the different styles of error-handling wrappers. 
Copies of the csapp.h and csapp. c files are available at the CS:APP Web site. 


1041 











































1042 Appendix A Error Handling 


A.1 Error Handling in Unix Systems 


The systems-level function calls that we will encounter in this book use three 
different styles for returning errors: Unix-style, Posix-style, and GAL-style. 


Unix-Style Error Handling 


Functions such as fork and wait that were developed in the early days of Unix (as 
well as some older Posix functions) overload-the function return value with both 
error codes and useful results. For example, when the Unix-style wait function 
encounters an error (e.g., there is no child process to reap), it returns —1 and sets 
the global variable errno to an error code that indicates the cause of the error. If 
wait completes successfully, then it retums the'useful result, which is the PID of 
the reaped child. Unix-style error-handling code is typically of the following form: 


1 if ((pid = wait(NULL)) < 0) 1 
2 fprintf(stderr, "wait error: %s\n", strerror(errno)); 
3 exit(0); 
4 } 
The strerror function returns a text description for a particular value of 
^6 
errno. 


Posix-Style Error Handling $ 
Many of;the newer Posix functions such as Pthreads use the.return value only 
to indicate success (zero) or failure'(nonzero). Any useful.results are returned 
in function arguments that are passed by reference. We refer to this approach as 
Posix-style error handling. Fer example, the Posix-style pthread createfunction 
indicates success or failure with its return value and returns the ID, of the newly 
created thread (the useful result), by reference,in its first argument. Posix-style 
error-handling code is typically ofthe following form: 


if ((retcode - pthread_ create(&tid, NULL, thread, NULL) t= 0j A 
fprintf(stderr, "pthread create error: %s\n", 'strerror(retcode)); 


A WN - 


exit(0); 
"T 
: CE, of ve 
The strerror function returns a text description for a particular value of 
rétcode- L n lin? 
, " 
GAI-Style Error Handling 4 


The getaddrinfo (GAI) and getnameinfo functions return zero on success and 
a nonzero value on failure. GAT' grror- -handlitig ‘code is typically of the follow- 
ing form: F 





t ay i 


| 


if ((retcode =getaddrinfo(host, service; &hints, &result)) !- ort 
fprintf(stderr, "getaddrinfo error: %s\n", gaitstrerror(retcode)) ; 
exit(0); ' 7 4 n j 





A w No 
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The gai_strerror function returns a text description for a particular value 
of retcode. a 


Summary of Error-Reporting, Functions u 


Thoughout this book, we use the following error-reporting functions to accommoG- 
date different error-handling styles. 


#include "csapp.h" 


T 


void unix error(chaf *nsg); 
void posix error(int code, char *msg); 


void gai error(int code, char *msg); 
void app .error(char *msg); 
"n 


Returns: nothing 





" 


As their names suggest, the unix error, posix error, and gai epi functions 
report Unix-style, Posix-style, and GAILstyle errors and then terminate. The app. 
error function is included as a convenience for application errors. It simply prints 
its input and then terminates. Figure A.1 shows the code for the error-reporting 
functions. 





k l 
A.2 Error-Handling Wrappers | 
Here are some examples of the different error-handling wrappers: 


Unix-style error-handling wrappers. Figure A.2 shows the wrapper for the Unix- | 
style wait function. If the wait returns with an error, the wrapper prints 
an informative message and then exits. Otherwise, it returns a PID to the 
caller. Figure A.3 shows the wrapper for the Unix-style kill function. 
Notice that this function, unlike wait, returns void on success. 


Posix-style error-handling wrappers. Figure A.4 shows the wrapper for the 
Posix-style pthread_detach function. Like most Posix-style functions, it I 
does not overload useful results with error-return codes, so the wrapper 
returns void on success. 


i 4 
GALstyle error-handling wrappers. Figure A.5 shows the error-handling wrap- 
per for the GAI-style getaddrinfo function. 


fh ! arin ty 
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—  —— — — — —- code/src/csapp.c 


on At A wN o 


23 


void unix error(char *msg) /* Unix-style error */ 

1 
fprintf(stderr, "As: %s\n", msg, '"strerror(errno));' 
exit(0); , m : 


} 


Hh oo 


void posix_error(int code, char *msg) /* Posix-style error */ 
{ 
fprintf(stderr, "As: %s\n", msg, strerror(code)); 
exit(0); 
} 


void gai_error(int code, char *msg) /* Getaddrinfo-style error x/ 
U LI 


fprintf(stderr, "4s: AsNn", msg, gai strerror(code)); 


exit(0); 

} t 
> t 

void app error(clíar *msg) /* Application error'*/ s 
{ a 

fprintf(stderr, "%s\n", msg); 

exit (0); 
} * w 


M — — —— codefsrc/csapp.c 


Figure Aj1 Error-reporting functions. 


ut 


" at ar | 


TT oT OT code/src/csapp.c 
1 pid t Wait(int *status) y i 

2 (t 

3 pid.t pid; 

4 

5 if ((pid = wait(status)) « 0) M 

6 unix error("Wait error"); 

z^. ME return pid;' 

8 } 





code/src/csapp.c 


Figure A.2 Wrapper for Unix-style wait function. 
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code/src/csapp.c 
1 void Kill(pid t pid, int signum) 
2 4 
3 int rc; 
4 
5 if ((rc = kill(pid, signum)) < 0) 
6 unix:error("Kill error"); 
7 4} 
code/src/csapp.c 
Figure A.3 Wrapper for Unix-style ki11 function. 
code/src/csapp.c 
1 void Pthread detach(pthread t tid) ( 
2 int rc; 
3 1 
4 if ((rc = pthread detach(tid)) != 0) 
5 posix error(rc, "Pthread detach error"); 
6 } 
code/src/csapp.c 


Figure A.4 Wrapper for Posix-style pthread_detach function. 





code/src/csapp.c : | 
1 void Getaddrinfo(const char *node, const char *service, 
2 const struct addrinfo *hints, $truct addrinfo **res) 
3 o 
4 int rc; 
5 
6 if ((rc = getaddrinfo(node, service, hints, res)) !- 0) 
7 gai error(rc, "Getaddrinfo error"); 
8 ) 


! code/src/csapp.c 


Figure A.5 Wrapper for GAI-style getaddrinfo function. 
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Page,numbers of defining references are italicized. Entries that belong to a hard- 
ware or software systertrare followed by a tag'in brackets that identifies the system, s 
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en 





along with a brief description to jog your memory. Here is the list of tags'and tHeir 


Unix program, function, variable, or constant ” 


meanings. 

[C] C Janguage construct 

[C Stdlib] C standard library function 

[CS:APP] Program or funétion developed in this text 
[HCL] HCL language construct 

[Unix] 

[x86-64] x86-64 machine-language instruction 
[Y86-64] Y86-64 machine-language instruction 


0 tð E 


! [HCL] Nor operatigh, 373 
$ for immediate operands, 181 
& [C] address of operation 
local variables, 248 
logic gates, 373 
pointers, 48, 188, 257, 277 
* [C] dereférence pointer operation, 
"188 
-» [C] dereférence and select field 
operation, 266' 


. (periods) in dotted-decimal notation, 


| | [HCL] or operation, 373' 

< operator for left hoinkies, 909 

<< “put to” operator (C++), 890 

> operator for right hoinkies, 909 

>> "get from" operator (C++), 890 
+‘, (two’s-complement addition), 60, 


+ (two’s-complement multiplica-, 
tion), 60, 97 
-, (two's-complenient negation), 60, 


* (unsigned addition), 60,85, 89 
= (unsigned multiplication), 60, 96 
E ¥ (unsigned negation), 60, 89 


8086 microprocessor, 167 


8087 floating-point coprocessor, 109, 
137, 167 
80286 microprocessor, 167 


.a archive files, 686 
a. out object file, 673 
Abel, Niels Henrik, 89 
abelian group, 89 
ABI (application binary, interface), 
310 
abort exception class, 726... 
aborts, 728 
absolute addressing relocation type, 
691, 693-694 
absolüte pathnames, 893 
absolute speedup of parallel programs, 
1019 
abstract operation model for Core i7, 
525-531 
abstractions, 27 
accept [Unix] wait for client 
connection request, 933, 936, 
936-937 
access 
disks, 597-600 
1A32 registers, 179-180 
main memory, 587-589 
x86-64 registers 


EI 


da 


e ; 
data moverhehl, 182-189 
operand specifiers, 180-182" 
access permission bits, 894 
access time for disks, 593, 593-595 
accumulator variable expansion, 570 
accumulators, multiple, 536—541 
Acorn RISC machine (ARM) 
ISAs, 352° 
processor architecture, 363 
actions, signal, 762 
active sockets, 935 
actuator arms, 592 
acyclic networks, 374 
adapters, 9, 597 5 
ADD [instruction class] add, 192 
add_client function, 981, 983 
add every signal to signal set 
instruction, 765 
add instruction, 192 
ADD operation in execute stage, 408 
add signal to signal set instruction, 765 
adder [CS:APP] CGI adder, 955 
addition 
floating point, 122-124 „302 
two’s complement, 90, ios 
unsigned, 84—90; 85. 
Y86-64,356 | 
additive inverse, 52* 1 
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addq [Y86-64] add, 356, 402 
address exceptions, status code for, 
404 
address of operator (&) [C] 
local variables, 248 
logic gates, 373 
pointers, 48, 188, 257, 277 
address order of free lists, 863 
address partitioning in caches, 615, 
615-616 r 3 


address-space layout randomization , 


(ASLR), 285, 285-286 
address spaces, 804 
child processes, 741 
linear, 804 
private, 734 
virtual, 804-805 
address translation, 804 
caches and VM integration, 817 
Core i7, 826-828 
end-to-end} 821-825 
multi-level page tables, 819-821 
optimizing, 830 
overview, 813-816 
TLBs for, 817-819 
addresses and addressing 
byte ordering, 42-49 
effective, 690" —.— 
flat, 167 ic 
internet, 922 
invalid address status code, 364 
I/O devices, 598 
IP, 924, 925-927 
machine-level programming, 170- 
171 
operands, 181 
out of bounds. See buffer overflow’ 
physical vs. virtual, 803-804 
pointers, 257, 277 
procedure return, 240 
segmented, 287-288 
sockets, 930, 933-934" 
structures, 265-267, 
symbol relocation, 690-691 
virtual, 804 
virtual memory, 34 
Y 86-64, 356, 359 
addressing modes, 181 
adjacency matrices, 660 
ADR [Y86-64] status code indicating 
invalid address, 364 
Advanced Micro Devices (AMD), 
165, 168 
Intel compatibility, 168 + 
x86-64. See x86-64 microprocessors 


Advanced Research Projects 
Administration (ARPA), 931 
advanced vector extensions (AVX) 
instructions, 294, 546-547 
AFS (Andrew File System), 610 
aggregate data types, 171 
aggregate payloads, 845 
%al [x86-64] low order 8 of register 
4rax}180: 
alarm [Unix] schedule alarm to self, 
762,763 
algebra, Boolean, 50-53, 52 
aliasing memory, 499, 500 
align directive, 366 
alignment 
data, 273, 273-276 
memory blocks, 844 
alloca [Unix] stack storage allocation 
function, 285, 290, 324 
allocate and initialize bounded buffer 
function, 1007 
allocate heap block function, ‘860, 
861 
allocate heap storage inci 840 
allocated bit, 848 
allocated blocks 
vs. free, 839 
placement, 849 
allocation 
blocks, 860 
dynamic memory, See dynamic 
memory allocation 
pages, 810 
allocators 
block allocation, 860 
block freeing and coalescing, 860 
free list creation, 857-859 
free list manipulation, 856-857, 
general design, 854-856 
practice problems, 861-862 
requirements and goals, 844-845 
styles, 839-840 
Alpha (Compaq Compiler Corp.) 
RISC processors, 363, 
alternate representations of sigtied 
integers, 68 
ALUADD [Y86-64] function code for 
addq instruction, 404 
ALUS (arithmetic/logic units), 70 
combinational circuits, 380 
in execute stage, 385 
sequential Y86-64 implementation, 
408-409 
always taken branch prediction 
strategy, 428 


f 
AMD (Advanced Micro Devices), 
165, 168 
Intel compatibility, 168 
microprocessor data alignment, 276 
x86-64. See x86-64 microprocessors 
Amdahl, Gene, 22 
Amdahl’s law, 22, 22-24, 562, 568 
American National Standards 
Institute (ANSI), 435° + 
ampersands (&)address,operator, 248 
local addresses, 248, 
logic gates, 373 
pointers, 48, 188, 257, 277 
AND [instruction class] and, 192 
and‘instruction?492 
'AND Operations s 
Boolean, 51-52 
execute stage, 408 
HCL expiféSsions, 374-375 
logic gates, 373i 
logical, 56-57 
AND packed double precision 
Ss tructioh, 305 à 
AND packed single precision 
instruction, 305 
andq [Y86-64] and, 356 
Andreesen, Marc, 949 
Andrew File System (AFS), 610 
anonymous files, 833 
ANSI (American National Stándhrds 
Institute), 4, 35; 
AOK [Y86-64] status code for normal 
operation, 363 
app.error [CS:APP] reports 
application errors, 1043 
application binary interface (ABD, 
310 
applications, loading and linking 
shared libraries from, 701-703 
AR Linux archiver, 636, 713 
arbitrary size arithmetic, 85 
Archimedes, 140 
architecture , 
floating-point, 293, 293-296 
Y86. See Y86-64 instruction,set 
„architecture 
archives, 686 
areal density of disks, 592 
areas 
shared, 834 
swap, 
virtual memory, 830 
arguments 
execve function, 750 . 
Web servers, 953-954 7? ' 


at 


arithmetic, 33, 191 
discussion, 196-197 
floating-point code, 302-304 
integer. See integer arithmetic 
latency and issue time,'523 
load effective address, 191—193 
pointers, 257—258, 873 
saturating, 134 
shift operations, 58, 104—106, 192, 


194-196 "hoa 
special, 197-200 a 
unary and binary, 194—196 


arithmetic/logic units (ALUs), 10 
combinational circuits, 380 
in execute stage, 385 
«sequential Y86-64 implementation, 
408-409 
ARM (Acorn RISC machine), 43 
ISAs, 352 
processor architecture, 363 
ARM A7 microprocessor, 353 
arms, actuator, 592 
ARPA (Advanced Research Projects 
Administration),:931 , 
ARPANET, 931 
arrays, 255 ! 
basic principles, 255-257 
declarations, 255-256, 263 
DRAM, 582 
fixed-sizé, 260-262 ' 
machine-code representation, 171 
nested, 258-260 
pointer arithmetic; 257-258 
pointer relationships, 48, A 
stride, 606 
variable-size, 262-265 
ASCII standard, 3 
character codes, 49 
limitations, 50 
asctime function, 1024 
ASLR (address-space dayout 
randomization), 285, 285—286 
asm directive, 178. 
assembler directives, 366 
assemblers,<5, 5, 164, 170 
assembly code, 5, 164 
with C programs, 289-290 
formatting, 175-177 ~ 
Y86-64, 359 ih 
assembly phase,5 v ! 
associate socket address with 
descriptor function, 935, 935 
associative caches, 624—626 
associative memory, 625 
associativity 
caches, 633 


floating-point addition, 123-124 
asterisks (*) dereference pointer 
operation, 188, 257, 277 
asymmetric tanges in two's- 
complement representation, 
66, 77 ? 
async-signal-safe function, 766 
async-signal safety, 766 
asynchronous interrupts, 726 
atomic reads and writes, 770 
ATT assembly code format, 177, 294, 
311 
argument listing, 306 
condition codes, 201—202 
cqo instruction, 199 
vs. Intel, 177 
operands, 181, 192 ^ 
Y 86-64, 356 
automatic variables, 994 
AVX (advanced vector extensions) 
instructions, 276, 204, 546—547 
Xax [x86-64] lowrorder 16 bits of 
register {rax, 180 


B2T (binary to two's-complement. 
conversion), 60, 64, 72, 97 

B2U (binary to unsigned conversion), 
60, 62, 72, 82, 97 

background processes, 753, 753-756 

backlogs for listening sockets, 935 

backups for disks, 611 

backward compatibility, 35 

backward taken, forward not taken 
(BTFNT) branch prediction 
strategy, 428 

bad pointers and virtual memory, 
870-871 

badcnt.c [CS:APP] improperly 

synchronized program, 995-999, 

996 

bandwidth, read, 639 

Barracuda 7400 drives, 600 

base pointers, 290 ` 

base registers, 181 

bash [Unix] Unix shell program, T 

basic blocks, 569 

Bell Laboratories, 35 

Berkeley sockets, 932 

Berners-Lee, Tim, 949 

best-fit block placement policy, 849, 
849 

bi-endian ordering convention, 43 

biased number encoding, 773, 113-117 


biasing in division, 106 4 
big-endian ordering convention, 42, 
42-44 
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bigrams statistics, 565 
bijections, 64,64 — tub 
/bin/kill program, 760 
binary files, 3, 897 
binary notation, 32 
binary points, 770, 110-111 
binary representations i 
conversions 
with hexadecimal, 36-37 
signed and unsigned, 70-76 
to two’s complement, 64, 72-73, 
97 
to unsigned, 62-63 
fractional, 109-112 ~“ 
machine language, 194 
binary semaphores, 7003. 
binary tree structure, 270-271 
bind [Unix] associate socket address 
with descriptor, 933, 935, 935 
binding, lazy, 706 
binutils package, 713 


. bistable memory cells, 581 


bit-level operations, 54—56 
bit representation expansion, 76-80 
bit vectors, 57, 51-52 
bits, 3 
overview, 32 a 
union access to, 271-272 
bitwise operations, 305-306% 
%b1 [x86-64] low order 8 of register 
%xrbx, 180 
block and unblock signals instruction, 
765 i 
block devices, 892 
block offset bits, 676 
block pointers, 856 
block size 
caches, 633 
minimum, 848 
blocked bit vectors, 759 ` 
blocked signals, 758, 759, 764-765 
blocking 
signals, 764-765 
for temporal Idcality, 647 
blocks 
aligning, 844 
allocated, 839, 849 ' 
vs. cache lines, 634 
caches, 617, 611—612, 615, 633 
coalescing, 850—851, 860 
epilogue, 855 " 
free lists, 847—849 
freeing, 860 e 
heap, 839 
logical disk, 595, 595-596, 601 
prologue, 855 x 
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blocks (continued) 1 
referencing data in, 874-875  , 
splitting, 8497850 

bodies, response, 952 

bool [HCL] bit-level signal, 374 

Boole, George, 50 

Boolean algebra and functions, 50 
HCL, 374—375 
logic gates, 373 
properties, 52 
working with, 50—53 

Boolean rings, 52 

bottlenecks, 562 u 
profilers, 565-568 
program profiling, 562-564 

bottom of stack,.190 

boundary tags, 851; 851—854, 859 

bounded buffers,.7004, 1005- 

1006 

bounds 
latency, 518, 524 
throughput, 518, 524- 

Abp [x86-64] low order 16 bits of 

register žrbp; 180 
%bp1 [x86-64] low order 8 of register 
Vrbp, 180 

branch prediction, 5/9, 519 
misprediction handling, 443—444. 
performance, 5494553 
Y86:64 pipelining, 428 

branch prediction logic, 275. 

branches, conditional, 172, 209 
assembly form, 211 
condition codes, 201—202. 
condition control, 209-213 
moves, 214-220, 550—553 
switch, 232-238 

break command 
in GDB, 280 
with switch, 233 

break multstore command in Gps, 

280 
breakpoints, 279-280 
bridged Ethernet, 920, 921 
bridges 
Ethernet, 920 
T/O, 587 
browsers, 948, 949 
. bss section, 674 
BTFNT (backward taken, forward 
not taken) branch prediction 
strategy, 428 

bubbles, pipeline, 434, 434—435, 
459-460 

buddies, 865 

buddy systems, 865, 865 





buffer overflow, 279 
execution code regions limits for, 
289-290 
memory-related bugs, 871 
overview, 279-284 
stack corruption detection for, 
286-289 
stack randomization for, 284-286 
vulnerabilities, 7 
buffered I/O functions, 898-902 
buffers tg 
bounded, 7004, 1005-1006 
read, 898, 900-901 
store, 557-558 
streams, 911 
bus transactions, 587 
buses, 8, 587 
designs, 588, 598 
FO, 596 
memory, 587 
bypassing for data hazards, 436-439 


. byte data connections:in hardware 


diagrams,.398 TED 
byte order, 42-49 
disassembled code, 209 
network, 925 
unions, 272 
bytes, 3, 34 
copying, 133 
range,36 - 
register operations, 181 
Y86 encoding, 359-360 
tox [x86-64] low order 16 bits of 
register Arbx, 180 


C language 
bit-level operations, 54:-56 
floating-point representation, 

124-126 
history, 35 
logical operations, 56-57 
origins, 4 
shift operations, 57—59. 
static libraries, 684—688 

C++ language; 677. 
linker symbols, 680 
objects, 266—267 T 
software exceptions, 723-724,.786 

. c source files, 671 

C standard library, 455, 6 

C11 standard, 35 

C90 standard, 35 

C99 standard, 35 
fixed data. sizes, 41 T 
integral data types, 67 

cache block offset (CO), 823 





cache blocks, 615 
cache-friendly code, 633—639, 634 
cache lines 
cache sets, 615 
vs. sets ánd blocks, 634 
caché-oblivious algorithms, 649 
cache set index (CI), 823* 
cache tags (CT), 823 
cached pages, 806 
caches and cache memory, 610, 615 
address translation, 823 ia 
anatomy, 631 
associativity, 633 
cache-friendly code, 633-639, 634 
data, 520, 631, 631 ‘ 
direct-mapped. See.direct-mapped 
caches 
DRAM, 806 
fully associative, 627-628 
hits, 6/2 
importance, 11-14 J " 
instruction, 518, 637, 631 7 
locality in, 605, 643-647, 810 
managing, 613: 
memory mountains, 6391643 
misses, 470, 612, 612—613 5 
organization, 615-617 
overview, 610—612 
page allocation, 810 
page faults, 808, 808-809 
page hits, 808 
page tables: 806—808, 807 
performance, 533, 631—633,639-647 
practice problems, 628—630 
proxy, 952 
purpose, 580 
set associative, 624, 624-626 
size, 632 
SRAM, 806 
symbols, 617 * 
virtual menfory with, 805-811, 817 
write issues, 630-631 
write strategies, 633 
Y86-64 pipelining; 469-470 
call [x86-64] procedure call, 241-242, 


357 
call [Y86-64] instruction,.404, 428 
callee procedures, 251 t 


callee-save registers, 257, 251-252 

caller procedures, 251 

caller-save registers; 251, 251-252 

calling environments, 783 

calloc function [C Stdlib] memory 
allocation 

declaration, 134 

dynamic memory allocation, 841 


security vulnerability, 100-101 
callg [x86-64] procedure call, 241 
calis, 17, 727-728 

error handling, 737-738 

Linux/x86-64 systems, 730-731 

in performance, 512-513 
canary values, 286—287 
canceling mispredicted branch 

handling, 444 
capacity 

caches, 615 

disks, 597, 591-592 

functional units, 523 
capacity misses, 673 
cards, graphics, 597 
carriage return (CR) characters, 892 
carry flag condition code, 201, 306 
CAS (column accéss stróbe) requests, 

583 
case expressions in HCL, 378, 378 
casting, 44 * 

explicit, 75 

floating-point values, 125 

pointers, 278, 854 

signed values, 70-71' 4. 
catching signals, 758, 761, 763 
cells 


DRAM, 582, 583 " 
SRAM, 581 
central processing units (CPUs), 9, 
9-10 
Core i7. See Core i7 microproces- 
Sors 


early instruction sets, 361 
effective cycle time, 602 
embedded, 363 
Intel. See Intel micfoprocessors 
logic design. Seé logic design 
many-core, 471 
multi-core, 16, 24-25, 168, 605, 972. 
overview; 352-354 " 
pipelining. See pipelining 
RAM, 384 
sequential Y86 implementation. 
See sequential Y86-64 
implementation 
superscalar; 26, 471, 518. 
trends, 602-603" 
Y86. See Y86-64 instruction set 
architecture 
Cerf, Vinton, 931. 
CERT (Computer Emergency 
Response Team), 100 
CF [x86-64] carry flag condition code, 
201,306 





CGI (common gateway interface) 
program, 953, 953-955 r 
CGLadder function, 955 
chains, proxy, 952 4 
char [C] data types, 40,61  . 
character codés, 49 
character dévices, 892+ 
check_clients function, 981, 984 
child processes, 740 
creating, 741-743 
default behavior, 744 
error conditions, 745-746 
exit status, 745 #1 
reaping, 743, 743-749 
waitpid function, 746-749 
CI (cache set index), 823 
circuits 
combinational, 374, 374—380 
retiming, 421 
sequential, 387 
CISC (complex instruction set 
computers), 361, 361-363 
4c1 [x86.64] low order 8 of register 
“rex, 180 
Clarke, Dave, 931 
classes 
data hazards, 435 
exceptions, 726—728 
instructions, 182 
Size, 863 
storage, 994-995 
clear bit in descriptor set macro, 978 
clear descriptor set macro, 978 
clear signal set insttuction; 765 
client-server model, 978, 918-919 
clienterror [CS:APP] Tiny helper 
‘function, 959-960 
clients 
client-server model, 978 
telnet; 27 
clock signals, 381 
clocked registers, 401—402 ? 
clocking in logic design, 381-384 
close [Unix] close file, 894, 894-895 
close operations for files, 897, 894-895 
close shared library function, 702 
closedir functions, 905 
c1tq [x86-64] Sign extend Zeax to 


Xrax, 185 

cmova [x86-64] move if unsigned: 
greater, 217 

cmovae [x86-64] move if unsigned 
greater or equal, 217 

cnovb [x86-64] move if unsigned less, 
217 
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cmovbe [x86-63) move ifunsigned less 
or equal, 217 E 

cmove [Y86-64] move when equal, 357 

cmovg [x86-64] move if greater, 217, 
357 

cmovge [x86164] move if greater or 
equal, 217, 357 

cmov1 [x86-64] move if less, 217, 357 

cmov1e [x86-64] move if less or equal, 
217,857 

cnovna [x86-64] move if not unsigned 
greater; 217 

cmovnae [x86-64] move if unsigned 
greater or equal, 217 

cmovnb [x86-64] move if not unsigned 
less, 217 

cmovnbe [x86-64] move if not unsigned 
less Gr equal, 217 

cmovne [x86-64] move if not equal, 
217,357 

cmovng [x86-64] move if not greater, 
217 

cnovnge [x86-64] move if not greater 
or equal, 217 

cmovn [x86:64] move if not less, 217 

cmovnle [x86-64] move if not less or 
equal, 217 

cmovns [x86-64] move.if nonnegative, 
217 

cmovnz [x86-64] move if not zero, 217 

cmovp [x86-64] move if even parity, 
324 

cmovs [x86-64] move if negative, 217 

cmovz [x86-64] move if zero, 217 


cMP [instruction class] Compare, 202 ..— 


cmpb [x86-64] compare byte; 202 
cmpl [x86-64] compare double word, 
202 


cmpq [x86-64] compare double word, 


202 
cmpw [x86-64] compare word: 202 
cmtest script, 465 -+ t 


CO (cache block offset), 823 1 
coalescing blocks, 860 
with boundary tags, 851-854 
free, 850 
memory, 847 
Cocke, John, 361 
code 
performance strategies, 561—562 
profilers, 562-564 
representing, 49—50 
self-modifying, 435 B 
Y86 instructions, 358, 359-360: 
code motion, 508 
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code segments, 696, 697-698 
Cohen, Danny, 43 
told caches, 672 
cold misses, 612 
Cold War, 931 
collectors, garbage; 839, 866 
basics, 866-867 
conservative, 867, 869-870 
Mark&Sweep, 867-870 
column access strobe (CAS) requests, 
583 
column-major sum function, 636 
combinational circuits, 374, 374—380 
combinational pipelines, 412-414, 
460-462 
common gateway interface (CGI) 
program, 953, 953-955: 
Compaq Computer.Corp. RISC 
processors, 363 E 
compare byte instruction, 202 
compare double precision, 306 
compare double word instruction, 202 
compare instructions, 202 
compare single precision, 306 
compare word instruction; 202 
comparison operations for floating- 
point code, 306-309 
compilation phase, 5, 
compilation systems, 6, 6-7 
compile time, 670 
compile-time interpositioning, 708— 
709 
compiler drivers, 4,671—672 
compilers, 6, 164 
optimizing capabilities and. 
limitations, 498-502 
process, 169-170 
purpose, 171 
complement instruction, 192 
complex instruction set computers 
(CISC), 361, 361-363 
compulsory misses, 612 x 
computation stages in pipelining, 
421-422 ) 
computed goto, 233 
Computer Emergency Response 
Team (CERT), 100 
computer systems, 2 
concurrency, 972 
ECF for, 723 
flow synchronizing, 776—778 
and parallelism, 24 
run, 733 
thread-level, 24-26 i 
concurrent execution, 733 


4 


concurrent flow, 733, 733-734 
concurrent processes, 15, 16 
concurrent programming, 972-973 
deadlocks, 1027-1030 
with I/O multiplexing, 978-985 
library functions in, 1024-1025 
with processes; 973-977 
races, 1025-1027 
reentrancy issues, 1023-1024 
shared variables, 992-995 
summary, 1030 
threads, 985-992 
for parallelism, 1013-1018 
safety issues, 1020-1022 
concurrent'programs, 972 
concurrent servers, 972 
based on prethreading, 1005-1013 
based on-processes, 974—975 
based on threads, 991—992 
condition code registers, 171 
hazards, 435 
SEQ timing, 401—402 
condition codes, 201, 201—202 
accessing, 202-205 
x86-64, 201 
Y86-64, 355-357 
condition variables, 1010 
conditional branches, 172, 209 
assembly form, 211 
condition codes, 201—202 
condition control, 209-213 
moves, 214—220, 550-553 
switch, 232—238 
conflict misses, 6/3, 622-624 
connect [Unix] establish connection 
with server; 934, 934—935 
connected descriptors, 936, 036-937 
connections 
EOF on, 948 
Internet, 925, 929—931 
VO devices, 596—597 
persistent, 952 
conservative garbage collectors, 867, 
869—870 
constant words in Y86-64, 359 
constants 
floating-point code, 304—305 
free lists, 856-857 
maximum and minimum values, 68 
multiplication, 101—103 
for ranges, 67-68 
Unix, 746 
content 
dynamic, 953-954 
serving, 949 


Web, 948, 949-950. 
context switches, 16, 736-737 
contexts, 736 5 
processes, 16, 732 
thread, 986, 993 > 
continue command, 280 
Control Data Corporation 6600 
processor, 522 
control dependencies in pipelining, 
419, 429 
control flow, 722 
exceptional. See exceptional control 
flow (ECF) 
logical, 732, 732-733 
machine-language procedures, 239 
control hazards, 429 
control logic blocks, 398, 398, 405, 426 
control logic in pipelining, 455 
control mechanism combinations, 
460—462 
control mechanisms, 459-460 
design testing and verifying, 465 
implementation, 462-464 
special cases, 455-457 
special conditions, 457-459 
control structures, 200-201 
condition codes, 200—205 
conditional branches, 209-213 
conditional move instructions, 
214-220 
jumps, 205-209 
loops. See loops 
switch statements, 232-238 
control transfer, 241—245, 722 
controllers 
disk, 595, 595-596 
T/O devices, 9 e 
memory, 583, 584 
conventional DRAMs, 582-584 
conversions $a 
binary 
with hexadecimal, 36-37 ; 
signed and unsigned, 70-76 
to two’s complement, 64, 72-73, 
97 
to unsigned, 62-63 1 
floating point, 125, 296-301 
lowercase, 509-511 
number systems, 36-39 
convert active socket to listening 
socket function, 935 
convert application-to-network 
function, 926 
convert double precision to integer 
instruction, 297 


convert double precision to: quad-word 
integer instruction, 297 
convert double to single precision 
instruction, 299 
convert host and service names 
‘function, 937, 937-940 
convert host-to-network long function, 
925 
convert host-to-network short 
function, 925 
convert integer to double precision 
instruction, 297 
convert integer to single precision 
instruction, 297 
convert network-to-application 
function, 926 
convert network-to-host long function, 
925 
convert network-to-host short 
function, 925 
convert packed single to packed 
double precision instruction, 298 
convert quad-word integer to double 
precision instruction, 297 
convert quad-word integer to single 
precision instruction, 297 
convert quad word to oct word 
instruction, 198 
convert single precision to integer 
instruction, 297 
convert single precision to quad-word 
integer instruction, 297 
convert single to double precisiori 
instruction, 298 
convert socket address to host and 
service names function, 940, 
940-942 
copy_elements function, 100 
copy file descriptor function, 909 
copy_from_kernel function, 86-87 
copy-on-write technique, 835, 835-836 
copying 
bytes in memory, 133 
descriptor tables, 909 
text files, 900 
Core 2 microprocessors, 168, 588 
Core i7 microprocessors, 25 
abstract operationmodel, 525-531 
address translation, 826—828 
caches, 631 
Haswell, 507 
memory mountain, 641 
Nehalem, 168 
page table entries, 826-828 
QuickPath interconnect, 588 


virtual memory, 825-828 
core memory, 757 
cores in multi-core processors, 168, 
605, 972 : 
correct signal handling, 770—774 
counting semaphores, 7003 
CPE (cycles per element). metric, 502, 
504, 507-508 ` ! 
cpfile [CS:APP] text file copy, 900 
CPI (cycles per instruction) 
five-stage pipelines, 471 
in performance analysis, 464—468 
CPUS. See central processing units 
(CPUs) 
cqto [x86-64] convert quad word to 
oct word, 198, 799 
CR (carriage return) characters, 892 
CR3 registér, 826 
Cray 1 supercomputer, 353 
create/change environment variable 
function, 752 
create child process function, 740,» 
741-743 
create thread function, 988 
critical path analysis, 498 
critical paths, 525, 529 
critical sections in progress graphs, 
1 000 Du 
CS:APP | 
header files, 746 
wrapper functions, 738, 1047 
csapp. c [CS:APP] CS:APP wrapper 
functions, 738, 7041 
csapp.h [CS:APP] CS:APP header 
file, 738, 746, 1041 
csh [Unix] Unix shell program, 753 
CT (cache tags), 823 
ctest script, 465 
ctime function, 1024 D" 
ctime ts [CS:APP] thread'safe non- 
reentrant wrapper for ctime, 
1022 
Ctrl+C key 
nonlocal jumps, 785 
signals, 758, 761, 795 
Ctrl+Z key, 761, 795 
current working directory, 892 
cvtsd2ss [x86-64] convert double to 
single precision, 209 ~ 
cvtss2sd [x86-64] convert single to 
double precision, 298. 
cycles per element (CPE) metric, 502, 
504, 507-508 
cycles per instruction (CPI) 
five-stage pipelines, 471 
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in performance analysis, 464—468 
cylinders 
disk, 597 
spare, 596 
%cx [x86-64] low order 16 bits of 
register 4rcx,180 


d-caches (data caches), 520, 631 
data I 
conditional transfers, 214-220 
forwarding, 436—439, 437 
sizes, 39-42! 
data alignment, 273, 273-276 
data caches (d-caches), 520, 631 
data dependencies in pipelining, 419, 
429-431 
data-flow graphs, 525-530 
data formats in machine-level 
programming, 177-179 
data hazards, 429 o. 
avoiding, 441-444 
classes, 435 
forwarding for, 436-439 
load/use, 439-441 
stalling, 433-436 
Y86-64 pipelining, 429-433 
data memory in SEQ timing, 401 
data movement instructions, 182-189 
data references 
locality, 606—607 
PIC, 704—705 
. data section, 674 
data segments, 696 
data structures, 265: 
data alignment, 273—-276' 
structures, 265-269 
unions, 269-273 
data transfer, procedures, 245-248 
data types. See types 
database transactions, 919 ^ 
datagrams, 924 ^ 
DDD debugger with graphical user 
interface, 279 
DDR SDRAM (double data-rate 
synchronous DRAM), 586 
deadlocks, 7027, 1027-1030 
deallocate heap storage function, 847 
- debug section, 675 ^ 
debugging, 279-280 
DEC [instruction class] decrement, 192 
decimal notation, 32 
decimal system conversions, 37-39 
declarations 
arrays, 255—256, 263 
pointers, 41 
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declarations (continued) 
public and private, 677 
structures, 265-269 
unions, 269-273 
decode stage p 
instruction processing, 385, 387-397 
PIPE processor, 449-453 
sequential processing, 400 
Y86-64 implementation, 406—408 
Y86-64 pipelining, 423 
decoding instruétions; 519 
decrement instruction, 192, 194 
deep copies, 1024 T 
deep pipelining, 418-419 + 
default actions with signal, 762 
default behavior for child processes, 
744 
default function code, 404 
deferred coalescing, 850 
ttdef ine [C] preprocessor directive 
delete command, 280 
delete environment variable function, 
752 
DELETE method in HTTP, 951 
delete signal from signal set 
instruction, 765 
delivering signals, 758 
delivety mechanisms for protocols, 
922, at 
demand paging, 810 m 
demand-zero pages, 833 È 
demangling process (C++ and Java), 
680, 680 
denormalized floating-point value, 
114, 114-116 
dependencies 
control in pipelining systems, 419, 
429 
data in pipelining systems, 419, 
429-431. 
reassociation transformations, 542 
write/read, 557-559 
dereferencing pointers, 48, 188, 257, 
277, 870-871 
descriptor sets, 947, 978 
descriptor tables, 907, 909 
descriptors, 891^ 
connected and listening, 936, 
936-937 
socket, 934» 
destination hosts, 922 
detach thread function, 990 
detached threads, 989 
detaching threads, 989-990 
Xdi [x86-64] low order 16 bits of:1 
register 4rdi, 180 


diagrams 
hardware, 398 
pipeline, 473 
Digital Equipment Corporation, 56 
Dijkstra, Edsger, 1001-1002 
%dil [x86-64] low order 8 of register 
žrdi, 180 
DIMM (dual inline memory module), 
584 
direct jumps, 206 
direct-mapped caches, 617 
conflict misses, 622-624 
example, 619-621 
line matching, 618 
line replacement; 619 
set selection, 618 
word selection, 619 
direct memory access (DMA), 17, 598 
directives, assembler, 176, 366 
directories 
description, 891, 891—892 
reading contents, 905-906 
directory streams, 905 
dirty bits 
in cache, 630 
Core i7, 827 
dirty pages, 827 
disas command, 280 
disassemblers, 44, 69, 173, 173-174 
disks, 589 
accessing, 597-600 
anatomy, 600 
backups, 611 
capacity, 591, 591-592 
connecting, 596—597 
controllers, 595, 595-596 
geometry, 590-591 
logical blocks, 595-596 
operation, 592-595 
trends; 602 
distributing software, 701 
division 
floating-point, 302 
instructions, 198-200 
Linux/x86-64 system errors; 729 
by powers of 2, 103-107 
divq [x86-64] unsigned divide, 198, 
200 
“Al [x86-64] low order 8 of register 
Ardx, 180 
dlclose [Unix] close shared library, 
702 
dlerror [Unix] report shared library 
error, 702: 
DLL (dynamic link library), 699 
dlopen [Unix] open shared libary, 701 


dlsym [Unix] get address of shared 
library symbol, 702 

DMA (direct memory access), 17, 598 

DMA transfer, 598 

DNS (domain name system), 928 

do [C] variant of while loop, 220—223 


do-while statement, 220 "er 
doit [CS:APP] Tiny helper function, 
956, 958, 958—959 


dollar signs ($) for immediate 
operands, 181 

domain names, 925, 927-929 

domain name system (DNS), 928 

dotprod [CS:APP] vector dot product, 


622 
dots (.) in dotted-decimal notation, 
926 4 


dotted-decimal notation, 926, 926 
double [C] double-precision floating 
point, 124,125 
double [C] integer data type, 41 
doüble data-rate synchronous DRAM 
(DDR SDRAM), 586 
double floating-point declaration, 178 
double-precision addition instruction, 
302 
double-precision division instruction, 
302 
double-precision maximum 
instruction, 302 
double-precision minimum 
instruction, 302 n 
double-precisiom multiplication 
instruction, 302 
double-precision representation 
C, 41, 124-126 
IEEE, 113, 113 
machine-level data, 178; 
double-precision square root 
instruction, 302 
double-précision subtraction 
instruction, 302 
double word to quad word instruction, 
199 
double words, 177 
DRAM. See dynamic RAM (DRAM) 
DRAM arrays, 582 
DRAM cells, 582, 583 «1 
drivers, compiler,;4, 671-672 1 
dual inline memory modüle (DIMM), 
584 
dup2 [Unix] copy file descriptor, 909 
duplicate symbol names, 680-684 
dynamic code, 290 
dynamic content, 701, 953-954 
dynamic link libraries (DLLs), 699 
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dynamic linkers, 699 
dynamic linking, 699, 699-701 
dynamic memory allocation 
allocated block placement, 849 
allocator design, 854-856 
allocator requirements and goals, 
844-845 
coalescing free'blocks, 850-851 
coalescing with boundary tags, 
851-854 
explicit free lists, 862-863 < 
fragméntation, 846 
heap memory requests, 850 
implementation issués, 846-847 
implicit free lists, 847-849 
nalloc and free functions, 840— 
843 
overview, 839-840 a 
purpose, 843-844 
segregated free lists, 863-865: 
splitting free blocks, 849-850 
dynamic memory allocators, 839-840* 
dynamic RAM (DRAM), 9, 582, 
taches, 806, 808, 808-809 
conventional, 582-584 
enhanced, 585-586 v ^ 
historical popularity, 586 
modules, 584, 585 
vs, SRAM, 582 
trends, 602-603 
dynamic Web content, 949 
%ax [x86-64] low order 16 bits of 
register 4rdx, 180 
E-way set associative caches, 624—625 
%eax [x86-64] low order 32 bits of- 
register 4rax, 180. 
hebp [x86-64] low order 32’ bits of) 
| " register {rbp, 180 
%ebx [x86-64] low order 32 bits of 
register 4rbx, 180 
ECF. See, exceptional control flow 
(ECF) 
ECHILD return codé; 746-747 
echo [CS:APP] read and echo input 
lines, 947 
echo function, 281—282, 287 
echo. cnt [CS:APP] counting version 
of echo, 7012 ~ 
echoclient.c [CS:APP] echo-client, 
944—945 
echoserveri.c [CS:APP] iterative 
echo server, 936-937, 947 
echoservert .c [CS:APP] concurrent 
echo server based on threads, 
991 


echoservert_pre :c [CS:APP] 
prethreaded concurrent echo 
server, 1011 

%ecx [x86-64] low order 32 bits of 
"register %rex, 180 e 

edi [x86-64] low order 32 bits'of 
register %rdi, 180 

EDO DRAM (extended'data out 

' DRAM), 586- 

%edx [x86-64] low order 32 bits of 
register 4rdx, 180 

EEPROMs (electrically erasable’ 
programmable ROMs), 587 

effective addresses, 181, 690 

effective cycle time; 602 r 

efficiency of parallel programs, 1079, 
1019 

EINTR return code, 746 

electrically erasable programmable 
ROMs (EEPROMs),587 s 

ELF. See executable and linkable 
format (ELF) 

EM64T processors, 168 


' embedded processors, 363 


encapsulation, 922 
encodings in machine-level 
programming, 169-170 
code examples, 172-1754 
code overview, 170-171 
formatting, 175-177 1 
Y86-64 instructions, 358-360 
end-of-file (EOF) condition, 897, 948 
end of line (EOL) indicators,892 
entry points,-696, 697-698 
environment variables lists, 751-752 
EOF (end-of-file) condition, 897,948 
EOL (end of line) indicators, 892 
ephemeral potts, 930 
epilogue blocks, 855 
EPIPE error return code, 964 
erasable programmabie ROMs 
(EPROMs), 587 
errno [Unix] Unix error variable, 
1042 
error-correcting codes for memory, 
582 
error handling 
system calls, 737-738 
Unix systems, 1042-1043 
wrappers, 738, 1041; 1043-1045 
error-reporting functions,.737 
errors 
child processes, 745-746 
link-time, 7 
off-by-one; 872; " 
race, 776, 776—778 
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reporting, 1043 1 
synchronization, 995^». 
%esi [x86-64] low order 32 bits of 
register Zrsi, 180 i 
Zesp [x86-64] low orden 32 bits of 
stack pointer register 4rsp, 180 
establish connection with server 
functions, 934, 934—935, 942-944 
establish listening socket function, 
944, 944 
etest script, 465 % i 
Ethernet segments, 920, 920: 
Ethernet technology, 920 ` 
EUs (execution units), 518, 520 
eval [CS:APP] shell helper.routine, 
754,755 cy 
event-driven programs, 980° 
based on I/O multiplexing, 980-985 
based on threads, 1013 : 
events, 723 \ 
scheduling, 763 
state machines, 980 
evicting blocks, 672 D 
exabytes, 39 
excepting instructiofis;445 à 
exception handlers, 724, 724 
exception handling 
in instruction processing, 385 
Y86-64, 363-364, 444—447 
exception numbers, 725 E t 
exception table base registers, 725 
exception tables, 725, 725 
exceptional control flow (ECF), 722 
exceptions, 723-731 
importance, 722—723 1 
noniocal jumps, 781-786 
process control: See processes 
signais. See signals: 
summary, 787 
system call error handling, 737-738 
exceptions, 223 £ 
anatomy, 723~724 we 
asynchronous, 726 
classes, 726—728 ny 
data alignment, 276 
handling, 724—726 
Linux/x86-64 systems, 729-731 
status code for, 404 
synchronous, 727 T 
Y86, 356 
exclamation points ! for NOT 
operation, 373 
EXCLUSIVE-OR Boolean operation, 51 
exclusive-or instruction 
x86-64, 192 
Y86-64, 356 
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EXCLUSIVE-OR operation in execute: 
stage, 408 a 
exclusivé-or packed double precision: 
instruction, 305 1 '- 
exclusive-or packed single precision 
instruction, 305 
executable and linkable format (ELF), 
673 
executable object files, 695-696 
header tables, 674, 696 
headers, 674-675 
relocatión? 690 
symbol tables, 675-679 
executable code, 170 
executable object filés, 4 
creating, 672 
description, 672 
fully linked, 696 
loading, 697—698 
running, 7-8 
executable object programs, 4 
execute access, 289 
execute disable bit, 827 
execute stage 
instruction processing, 385, 387-397 
PIPE processor, 453—454 
sequential processing, 400 
sequential Y86-64 implementatión, 
408-409 
Y86-64 pipelining, 423 . 
execution 
concurrent, 733 ‘ 
parallel, 734 
speculative, 579, 519, 549-550 
tracing, 387, 394—395, 403 
execution code regions, 289—290 
execution units (EUs), 518, 520 
execve [Unix] load program, 750 
arguments and environment 
variables, 750-752 
child processes, 699,701 
loading programs*697° . 
running programs, 753-756 
virtual memory, 836-837 
exit [C Stdlib] terminate process, 739 
exit status, 739, 745 "n 
expanding bit representation, 76-80 
expansion slots, 597 
explicit allocator requirements and 
goals, 844-845 
explicit dynamic memory allocators, 
839-840 
explicit free lists, 862-863 
explicit thread termination, 988 
explicit waiting for, signals, 778—781 
explicitly reentrant functions, 1023 
exploit code, 284 


exponents in floating-point 
representation, 7/2 

extend_heap [CS:APP] allocator: 
extend heap, 858 

extended data out DRAM (EDO 
DRAM), 586 

extended precision floating-point 
representation; 137, 137 

external exceptions in pipelirfing, 444 

external fragmentation, 846, 846 


fall through in switch statements, 233 
false fragmentation, 850 
fast page mode DRAM (FPM 
DRAM),585 « 
fault exception class, 726 
faulting instructions, 727 
faults, 728 
Linux/x86-64 systems, 729, 832-833 
Y86-64 pipelining caches, 470 
FD_CLR [Unix] clear bit in descriptor 
set, 977, 978 
FD_ISSET [Unix]. bit turned on in 
descriptor set, 977>978, 980 
FD_SET [Unix] set bit in descriptor set, 
977, 978 
FD_ZERO [Unix] clear descriptor set, 
977, 978 
feedback in pipelining, 419-421, 425 
feedback paths, 396, 419 
fetch file metadata function, 903 
fetch stage E 
instruction processing, 384, 387—397 
PIPE processor, 447-449 
SEQ, 404—406 
sequential processing, 400 
Y86-64 pipelining, 423 
fetches, locality, 607-608 
fgets function, 282 
Fibonacci (Pisano), 32 
field-programmable gate arrays 
(FPGAs), 467 > 
FIFOs; 977 5 
file descriptors, 897 
file position, 697 i 
file tables, 736, 906 
FILE type, 911 
filenames, 892 
files, 79 
as abstraction, 27 ë 
anonymous, 833 
binary, 3 
metadata, 903—904 
object. See object files 
register, 70, 171, 358-359, 382-383, 
401, 521 
regular, 833 


sharing, 906—908 
system-level I/O. See system-level 
UO 
types, 891—893 
Unix, 890, 890—891 
FINGER command, 284 
fingerd daemon, 284 ; 
finish command, 280 
firmware, 587 
first-fit block placement policy, 849, 
849 
first-level domain names, 927 
first readers-writers problem, 1008 
fits, segregated, 863, 864-865 n 
five-stage pipelines, 471 " 
fixed-size arithmetic, 85 
fixed-size arrays, 260—262 
fixed-size integer types, 41, 67 
flash memory, 587 x 
flash translation layers, 600-601 
flat addressing, 167 
float [C] single-precision floating 
point, 124 + 
float floating-point declaration,178 
floating-point code- 
architecture, 293, 293-296 
arithmetic operations, 302-304 
bitwise operations, 305-306 
comparison operations, 306-309 
constants, 304-305 
movement and conversion 
' operations, 296-301) 
observations, 309 
in procedures, 301—302 
floating-point representation and 
programs, 108-109 
arithmetic, 33 
C, 124-126 
denormalized values, 114, 114-116 
encodings, 32 
extended precision, 737, 137 
fractional binary numbers, 109-112 
IEEE, 112-114 
nornialized value, 113-114 
operations, 122-124 
overflow, 127 
pi, 140 2 
rounding, 720, 120-422 a 
special values, 115 
support, 40 ' 
x87 processors, 167 
flows 2 
concurrent, 733, 733-734 
control, 722 
logical; 732, 732-733 
parallel, 734 
synchronizing, 776—778 
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flushed instructions, 522 dynamic memory allocation, 847— conservative, 867, 869: 
FNONE [Y 86-64] default function code, 849 870 
404 explicit, 862-863 Mark&Sweep, 867-870 
footers of blocks; 857 implicit, 848 a t overview, 865-866 R | 
for [C] general loop’ statement, manipulating, 856-857 gates, logic,i373 1 | 
228—232 segregated, 863-865 ccc (GNU compiler collection) 
guarded-do translation, 225 free software, 6 compiler ! 
jump-to-middle translation, 223 free up getaddrinfo resources function, code formatting, 175-176 Ht 
forbidden regions, 1003 937 inline assembly, 178 | 
foreground processes, 753 freeaddrinfo [Unix] free up options, 35 |, 
fork [Unix] create child process, 740 getaddrinfo resources, 937, working with, 168-169 "« | 
child processes, 701 938 GDB GNU debugger, 173, 279, 279- H 
example, 741-743 FreeBSD open-source operating 280 i 
running programs, 753-756 system, 86-87 ` general protection faults, 729 Y 
virtual memory, 836 freeing blocks, 860 general-purpose registers, 179, 179- | 
fork.c [CS:APP] fork example, 747 Freescale 180 : | 
formal verification in pipelining,466 processor family, 352 » geometry of disks, 590-591 E 
format strings, 47 RISC design, 361 get address of shared library symbol 
formats for machine-level data, front side bus (FSB), 588 function, 702 | 
177-179 fstat [Unix] fetch file metadata, 903 “get from” operator (C++), 890 y 
formatted disk capacity, 596 full duplex connections, 929 GET method in HTTP, 951 | 
formatted printing, 47 full duplex streams, 912 ! get parent process ID function, 739 
formatting fully associative caches, 626 get process group ID function; 759 | j 
disks, 596 line matching and word selectidn, get process ID function, 739 
machine-level code, 175-177 627-628 get thread ID function, 988 : 
forwarding set selection, 627 getaddrinfo [Unix] convert host and 
for data hazards, 436-439 fully linked executable object files,: Service names, 937, 937-940 i 1 
load, 477 696 getenv [C Stdlib] read environment 
forwarding priority, 451—452 fully pipelined functional units, 523 variable, 751 { 
FPGAs (field-programmable gate function calls gethostbyaddr [Unix] get DNS host 
arrays), 467 performance strategies, 561 entry, 1024 | 
FPM DRAM (fast page mode PIC, 705-707 gethostbyname [Unix] get DNS host 
DRAM), 585 function part in Y86-64 instruction entry, 1024 
fprintf [C Stdlib] function, 47 specifier, 358 getnameinfo [Unix] convert socket | 
fractional binary numbers, 109-112 functional units, 520-521, 523-524 address to host and service j 
fractional floating-point representa- functions names, 940, 940-942 1 
tion, 112-120, 137 pointers to, 278 getpeername function [C Stdlib] ! 
fragmentation, 846 I} reentrant, 766, 1023 security vulnerability, 86-87 
dynamic memory allocation, 846 static libraries, 684-688 getpgrp [Unix] get process group ID, | 
false, 850 system-level, 730 759 ] 
frathé pointers, 290 thread-safe and thread-unsafe, getpid [Unix] get process ID, 739 
frames 1020, 1020-1022 getppid [Unix] get parent process | 
Eth€thet, 920 wrapper, 711 ID, 739 
stack, 240, 240-241, 276, 290-293 in Y86 instructions, 359 getrusage [Unix] function, 811 | 
free blocks, 839 oy Me gets function, 279, 281-282 
coalescing, 8505851 a gai error [CS:APP] reports GAI- ? GHz(gigahertz), 502 ! 
splitting, 849-850 style errors, 1043 giga-instructions per second (GIPS), i 
free bounded buffer function, 1007 gai strerror [Unix] print 413 | 
free [C Stdlib] deallocate heap getaddrinfo error message, gigabytes, 592 i 
storage, 841, 841-843  ; 938 gigahertz (GHz), 502 
interpositioning libraries, 708 GALstyle error handling, 7042, GIPS (giga-instructions.per second), 
wrappers for, 711 1042-1043 413 | 
free heap block function, 860 gaps between disk sectors, 590, 596 global IP Internet. See Internet f 
free heap blocks, referencing data in, garbage, 866 % Global Offset Table (GOT), 705, 
874-875 garbage collection, 840, 866 705—707 
free lists garbage collectors, 840, 866 global symbols, 675 


| 
| 


creating, 857-859 basics, 866-867 global variable mapping, 994-995 
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GNU compiler collection. See acc 
(GNU compiler collectión) 
compiler’ 

GNU project, 6 a 

GOT (global offset tablé), 705, 
705-707 poy 

goto [C] control transfer statement, 
210, 233 ad 

goto code, 210 t 

aproF Unix profiler, 562, 562-563 

gradual underflow 115 d 

granularity of concurrency, 985 

graphic user interfaces for debuggers, 
279 . 

graphics adapters, 596 

graphs 

data-flow, 525—530 
process, 741, 742 
progress. See progress graphs 
reachability, 866 
greater than signs >, 
deferencing operation, 266 
*get from" operator, 890 
right hoinkies, 909 
groups 
^abelian, 89 
process, 759 
guard values, 286 F 
guarded-do translation, 225 


.h header files, 686 
half-precision floating-point 
representation, 137, 137 
halt [Y86-64] hale instruction 
execution, 357 
code for, 404—405 
exceptions, 364,444—447 
in pipelining, 462 
handlers v 7 
exception, 724, 724 
interrupt, 726 ‘ 
signal, 758, 763 r 
handling signals € 
blocking and unblocking, 764—765 
portable, 774-775 
hardware caches. See caches and cache 
“memory « 
hardware control language (HCL), 
372 
Boolean expressions; 374-375 
integer expressions, 376380 
logic gates, 373 
hardware description languages 
(HDLs), 373,.467 
hardware exceptions, 724 
hardware interrupts, 726 


hardware management, 14-15 
hardware organization, 8 
buses, 8 
I/O devices, 9 
main memory, 9 e 
processors, 9-10 
hardware registers, 381-384 
hardware»structure for Y86-64, 
396—400 
hardware units, 396—398, 401 
hash tables, 567—568 
Haswell microarchitecture, 825 
Haswell microprocessors, 168, 215, 
294, 507, 521, 523 
hazards in pipelining, 354, 429 
avoiding, 441—444 
classes, 435 
forwarding for, 436-439 5 
load/use, 439-441 
overview, 429-433 
stalling for, 433-436 i 
HCL (hardware control language); 
372 
'Boolean éxpressions, 374—375 
integer expressions, 376-380 
logic gates, 373 
HDLs (hardware description 
languages), 373, 467 
head crashes, 593 
HEAD method in HTTP, 951 A 
header files 
static libraries, 687 
system, 746 n 
header tables in ELF, 674, 696 
headers * 
blocks, 847 t 
Ethernet, 920 -+ 
request, 951 
response, 952 ’ 
heap, 18, 18-19, 839 
dynamic memory allocation, 839— 
840 ` 
Linux systems, 697 
referencing data in, 874-875 
requests, 850 
hello [CS:APP] C hello programi, 2; 
10-12 
help command, 280 
helper functions, sockets interface, 
942-944 
Hennessy, John, 361, 471 
heterogeneous data structures, 265 
data alignment, 273-276 J+ 4 
structures, 265-269 a4 
unions, 269-273 Al 
hexadecimal (hex) notation, 36, 36-39 
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hierarchies 
'domain name, 927 
storage devices, 14, 14, 609-614 
high-level design performance- 
strategies, 561 » 
hit rates, 631 
hit time, 631 
hits 
cache, 612, 631 
write, 630 
hit [x86-64] halt instruction 
execution, 357 
HLT [186-64] status code indicating 
halt instruction, 364 
hoinkies, 909, 910 
holding mutexes, 7003 
Horner, William, 530 i 
Horner’s method, 530" Hy 
host bus adapters; 597 
host bus interfaces, 597 
host entries, 928 
host information program command,, 
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926 
HOSTNAME command, 926 
hosts 
client-server model, 979 
network, 922 "m 
number of, 930 £x 
Sockets interface, 937—942 
htest script? 465 
HTML (hypertext markup language), 
948, 948—949 ! 
htonl [Unix] convert host-to-network 
long, 925 à 
htons [Unix] convert host-to-network 
short, 925 
HTTP. See hypertext transfer protocol 
(HTTP) e 
hubs, 920 ,> 


hyperlinks, 948 
hypertext markup language (HTML), 


948, 948-949 EIS 
hypertext transfer pee (HTTP), 
948 è 


dynamic content, 953-954 

methods, 951—952: 

requests, 951, 951-952 

responses, 952, 952-953 

transactions, 950—951 ` 
hyperthreading, 24,168  . 
Hyper [yansport interconnect, 588 


i-caches (instruction caches), 518, 631 
. i source files, 671 i 

1386 microprocessor, 167 

1486 microprocessor, 167 


1A32 (Intel Architecture 32-bit) 
microprocessors, 45, 168 
‘machine language, 165-166 
registers, 179-180 
iaddq [Y86-64] immediate add, 369 
IBM 
Freescale microprocessors, 352, 361 
out-of-order processing, 522 
RISC design, 361-363 
ICALL [Y86-64] instruction code for 
call instruction, 404 
ICANN (Internet Corporation for 
Assigned Names and Numbers), 
927 
icode (instruction code), 384, 405 
ICUs (instruction control units), 518 
identifiers, register, 358 
idivl [x86-64] signed divide, 199 
idivq [x86-64] signed divide, 198 
IDs (identifiers) 
processes, 739-740 
register, 358-359 
IEEE. See Institute for Electrical and 
Electronics Engineers (IEEE) 
if [C] conditionalstatement, 211-213 
ifun (instruction function), 384, 405 
IHALT (Y86-64] instruction code for 
halt instruction, 404 
IIRMOVQ [Y86-64] instruction code for 
irmovg instruction, 404 
ijk matrix multiplication, 644—646, 645 
IJXX [Y86-64] instruction code for 
jump instructions, 404 
ikj matrix multiplication, 644-646, 645 
illegal instruction exceptions, 404 
imem_error signal, 405 
immediate add instruction, 369 
immediate coalescing, 850 
immediaté offset, 181: 
immediate operands, 181 
immediate to'register move 
instruction, 356 
implicit dynamic memory allocators, 
840 
implicit free lists, 847-849, 848 
implicit thread termination, 988 
implicitly reentrant functions, 1023 
implied leading 1 representation, 114 
IMRMOVQ [Y86-64] instruction code for 
mrmovgq instruction, 404 
IMUL [instruction class] multiply, 192 
imulq [x86-64] signed multiply, 198, 
198 A. 
in [HCL] set membership test; 381 
in_addr [Unix] IP address structure, 
925 





INC [instruction class] increment, 492 

include files, 686 

#include [€] preprocessor directive, 
170 

incq instruction, 194 

increment instruction, 192, 194 

indefinite integer values, 125 

index.html file, 950 h 

index registers, 181 

indexes for direct-mapped caches, 
622-624 

indirect jumps, 206, 234 

inefficiencies in loops, 508-512 

inet_ntoa [Unix] convert network- 
to-application, 1024 

inet_ntop [Unix] convert network- 
to-application, 926 

inet_pton [Unix] convert 
application-to-netwotk, 926: 11 

infinity 

constants, 124 
representation, 114-115 

info.frame command, 280 

info registers command, 280 

information, 2-4 

information access with x86-64 
registers, 179-180 
data movement, 182-189 
operand specifiers; 180-182 


` 


information storage, 34 t 
addressing and byte ordering, 42—49 
bit-level operations, 54-56 i 
Boolean algebra, 50-53 
code, 49-50 


data sizes, 39-42 

disks. See disks 

floating point. See floating-point 
representation and programs « 


hexadecimal, 36-39: + u 
integers. Sèe integers, i» d 
locality..See locality — ' 1 


logical operations, 56-57 
memory. See memory 1 
segregated, 863 
shift operations, 57—59 
strings, 49 
summary, 648 
init function, 743 
init. pool function, 981, 983 
initial state in progress graphs, 999 
initialize nonlocal handler jump 
function, 783 
initialize nonlocál jurip.functions, 783 
initialize read buffer function, 898, 
.900 
initialize semaphore function, 7002 
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initialize thread function, 990 ) + 
initializing threads, 990 
inline assembly, 178. bu 
inline substitution, 501 
inlining, 501 
INOP [Y86-64] instruction code for 
nop instruction, 404 
input events, 980 
input/output. See I/O (input/output) 
insert item in bounded buffer function, 
1007 
install portable handler function, 775 
installing.signal handlers, 763 
Institute for Electrical and Electronics 
Engineers (IEEE) 
description, 109 
floating-point représentation and 
programs, 112-114 
denormalized, 114 
normalized,.113—-114 
special values, 115 
Standard 754, 109. 
"standards, 109 
Posix standards, 16+ 
instr, valid signal, 405—406 
instruction caches (i-caches), 518, 631 
instruction code (icode), 384, 405 
instrüction control units (IGUs), 518 
instruction function (ifun), 384, 405 
instruction-level parallelism, 26, 497, 
518,562 ~ 
instruction memory in SEQ timing, 
401 
instruction set architectures (ISAs), 
10,27,170,352 
instruction set simulators, 366 
instructions 
classes, 182 
decoding, 518 
excepting! 445 x 
fetth locality; 607—608 
issuing, 427-428 we 
jump, 10, 205-209 
load, 10 
low-level. See machinerlevel 
programming ` 
move, 214-220, 550-553 
operate, 10 
pipelining, 468-469, 549  : 
privileged, 735 


store, 10 

update, 9-10 n 

Y86-64. See Y86-64 instruction set 
architectüre 


instructions per cycle (IPC), 471 
int [C] integer data type, 40 
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int [HCL] integer signal, 376 
int data types, integral, 61 
INT. MAX constant; maximum'signed 
integer, 68: 
INT MIN constant, minimum signed 
integer, 68 
int32 t [Unix] fixed-size, 47 
integer arithmetic, 84, 192 
division by powers of 2, 103-107 
multiplication by constants, 101— 
103 
overview, 107-108 
two's corhplement.addition, 90-95 
two's complement multiplication, 
97-101 
two's complement negation, 95 
unsigned addition, 84-90 ^ 
integer bits in floating-point 
representation, 137 
integer expressions in HCL, 376—380 
integer indefinite values, 125 
integer operation instruction, 404 
integer registers in x86-64, 179-180 
integers, 32, 59-60 
arithmetic operations. See integer 
arithmetic 
bit-level operations, 54-56 
bit representation expansion, 76—80 
byte order, 43-44 
data types, 6062 
shift opérations, 57-59 
signed and unsigned conversions, 
70-76 
signed vs. unsigned guidelines, 
83-84 
truncating, 81-82 
two’s complement representation, 
64-70 
unsigned encoding, 62-64 
integral data types, 60, 60-62 
integration of caches‘and VM, 817 
Intel assembly-code format, 177, 294, 
311 ' ' 
Intel Corporation, 165 
Intel microprocessors 
8086, 26, 167 
80286, 167 
Core 2, 168, 588 
Core i7. See Core i7 microptoces- 
sors 
data alignment, 276 
evolution, 167-168 
floating:point representation, 137 
Haswell, 168, 215, 294, 523 
1386, 167 
1486, 167 


nórthbridge and southbridge 
chipsets, 588 eð 
out-of-order.processirig; 522 
Pentium, 167 
Pentium IL 167 *% 
Pentium II]; 167-168 
Pentium 4, 168 
Pentium 4E, 168 
PentiumPro, 167, 522 
Sandy Bridge, 168... 
X86-64. See x86-64 microprocessors 
Y86-64. See Y86-64 instruction set 
architecture 
interconnected networks (internets), 
921, 921-922 
interfaces 
bus, 588 
host bus, 597 
interlocks, load, 441 
internal exceptions in pipelining, 444 
internal fragmentation, 846 
internal read function, 901 
International Standards Organization 
(ISO), 4, 35 
Internet, 927 
connections, 929-931 
domain names; 927+929 
IP addresses, 925-927 
organization, 924-925 
origins, 931 
internet addresses,922 
Internet Corporation for Assigned 
Names and Numbers (ICANN), 
927 
Internet domain names, 925 
Internet Domain Survey, 930 
Internet hosts, number of, 930: 
Internet Protocol (IP), 924 
Internet Softwaré Consortium, 930 
Internet worms, 284 
internets (interconnected. networks), 
921, 921-922 af 
interpositioning libraries, 707; 707- 
708 
compile-time, 708-709 
link-time, 708, 710 
run-time, 710-712 
interpretation of bit patterns, 32 
interprocess communication (IPC), 
977 
interrupt handlers, 726 ! 
interruptions, 764 
interrupts,.726, 726-727 
interval coünting schemes, 564 
INTN. MAX [C] maximum value of 
N-bit signed data type, 67 


INTN MIN [C] minimum value of* 
N-bit signed data type; 67 
intN_t [C] N-bit signed integer data 
type, 67 
<inttypes.h> fixed-size integer 
types, 198 
invalid address status code, 364 
invariants, semaphore, 1002 
I/O (input/output), 9, 890 j 
memory-mapped, 598 ? 
ports, 598 — ' 
redirection, 909, 909-910 4 
system-level. See system-level I/O 
Unix, 19, 890, 890-891 
I/O bridges, 587 
I/O buses, 588, 596, 598 
I/O devices, 9 
addressing, 598 
connecting, 596-597 ,. 
YO multiplexing, 973 
concurrent progfamining with, 
978-985 
event-driven servers based on! 
980-985 
pros and cons, 985 
IOPL [Y86-64] instruction code for 
integer,operation instruction, 
404 
IP (Internet Protocol), 924” 
IP address structure, 925, 926 
JP addresses, 924, 925—927 
IPC (instructions per cycle); 47T 
IPC (interprocess communication), 
977 s 
iPhone 5S, 353 à 
IPOPQ [Y86-64] instruction code for 
popq instruction, 404 
IPUSHQ [Y 86-64] instruction code for 
pushq instruction, 404 
IPv6, 925 
IRET [Y86-64] instruction code for 
ret instruction, 404 
IRMMOVQ,[ 86-64] instruction code for 
xmmovq instruction, 404 
irmovg [Y86-64] immediate to register 
move, 356, 404 
IRRMOVQ [Y86-64] instruction code for 
rrmovq instruction, 404 
ISAs (instruction set'architectures), 
10,27,170, 352 
ISO (International Standards 
Organization), 4, 35 
ISO C11 C stahdard, 35 
ISO C90 C standard, 35 v , 
ISO C99 C standard, 35, 411 324, 
integral data types, 67 


static libraries, 684-688 
isPtr function, 869 
issue time for arithmetic operations, 
523 J 
issuing instructions, 427—428 * 
iterative servers, 946 
iterative sorting routines,567: 


ja [x86-64] jump if unsigned greater, 
206 
jae [x86-64] jump if unsigned greater 
or equal, 206 
Java language, 677 ¥ 
byte code, 310 r 
linker symbols, 680 a 
numeric ranges, 68 
objects, 266-267 
software exceptions, 7234724, 786 
threads, 1030 
Java monitors, 1010 
Java Native Interface (JNI), 704 
jb [x86-64] jump if unsigned less, 206 
jbe [x86-64] jump if unsigned less or 
equal, 206 
je [Y86-64] jump when equal, 357, 
394 
jg [x86-64] jump if greater, 206, 357 
jge [x86-64] jump if greater or equal, 
206, 357. 
jik matrix multiplication, 644—646, 645 
jki matrix multiplication, 644-646, 645 
j1 [x86-64] jump if less, 206, 357 
j1e [x86-64] jump if less or equal, 206, 
357 


jup [x86-64] jump unconditionally, 

jna [x86-64] jump if not unsigned 
greater, 206 

jnae [x86-64] jump if unsigned greater 
or equal, 206 

jnb [x86-64] jump if not unsigned less, 
206 

jnbe [x86-64] jump if not unsigned 
less or equal, 206 

jne [x86-64] jump if not equal, 206, 
357 

jng [x86-64] jump if not greater, 206 

jnge [x86-64] jump if not greater or 
equal, 206 

JNI (Java Native Interface), 704 

jnl [x86-64] jump if not less, 206 

jnle [x86-64] jump if not less or equal, 
206 

jns [x86-64] jump"if nonnegative, 206 

jnz [x86-64] jump if not zero, 206 

jobs, 760 





joinable threads, 989 

jp [x86-64] jump when parity flag set, 
306 1 

js [x86-64] jump if negative, 206 

jtest script;465 

jump if greater instruction, 206, 357 

jump if greater ór equal instruction, 
206, 357 

jump if less instruction, 206, 357 

jump if less or equal instruction, 206, 
357 

jump if negative instruction, 206 

jump if nonnegative instructión, 206 

jump i£ not equal instruction, 206, 357 

jump if not greater instruction, 206 

jump if not greater or equal 
instruction, 206 

jump if not less instruction, 206° 

jump.if not less or. equal instruction, 
206 

jump if not unsigned greater 
instruction, 206 

jump if not unsigned Jess instruction, 
206 

jump if not unsigned less or.equal 
instruction, 206 

jump if not zero instruction, 206 

jump if unsigned greater instruction, 


206 
jump if unsigned greater or equal 
instruction, 206 "5 


jump if unsigned less instruction, 206 
jump if unsigned less or equal 
instruction, 206 j 
jump if zero instruction, 206 
jump instructions, 70, 205-209, 404, 
direct, 206 
indirect, 206, 234 
instruction code.for, 404 
nonlocal, 723, 781, 781—786 
targets, 206 
jump tables, 233, 234-235, 725 
jump-to-middie translation, 223 
jump unconditionally instruction, 206, 
206 b 
jump when equal instruction, 357 
jump when parity flag-set instruction, 
306 A 
just-in-time compilation, 290, 3102 
jz [x86-64] jump if zero, 206 


k x 11oop unrolling, 531 

k x la loop unrolling, 544 

k x k loop unrolling, 539-540 
K&R (C book), 4 

Kahan, William, 109 
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Kahn, Robert, 931 
kernel mode 
exception handlers, 726: 
processes, 734—736, 735 
System calls, 728 
kernels, 77, 19, 698 
exception numbers, 725 — 4b" 
virtual memory; 830-831 
Kernighan, Brian, 2, 4, 16, 35, 278, 914 
Kerrisk, Michael, 914: 
keyboard, signals from, 760-761 
kij matrix multiplication, 644-646, 645 
kill [Unix] send signal, 767 
kill command in Gps debugger, 280 
kill.c [CS:APP] kill example, 761 
kji matrix multiplication, 644—646, 645 
Knuth, Donald, 849, 851 
ksh [Unix] Unix shell program, 753 


l suffix, 179 
L1 cache, 13, 615 
L2 cache, 73,615 
L3 cache, 615 
labels for jump instructions, 205 
LANS (local area networks), 920, 
920—922 
last-in, first out discipline, 189. 
last-in first-out (LIFO) free list order, 
863 
latency 
arithmetic operations, 523, 524 
disks, 594 
instruction, 413 
load operations, 554—555. 
pipelining, 472 ' 
latency bounds, 518, 524 
lazy bindifig, 706 
LD Unix static linker, 672 
LD-LINUX.SO linker, 699. n 
LD_PRELGAD environment variable, 
710-712 
LDD tool, 713 
LEA instruction, 102 
leaf procedures, 247 
leaks, memory, 875, 992 
leaq [x86-64] load effective address, 
191, 191-192, 277 
least-frequently-used (LFU) 
replacement policies,.626 
least-recently-used (LRU) replace- 
ment policies, 672, 626 
least squares fit, 502, 504 
leave [x86-64] prepare stack for 
returií instruction, 292 
left hoinkies («), 970 
length of strings, 83 
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less than signs « 
left hoinkies, 909 
“put to" operator, 890 
levels 
optimization, 498 
storage, 609—610 
LF (line feed) characters, 892 
LFU (least-frequently-used) 
replacement policies, 626 
libc library, 911 
..libc.start. main, 698 
libraries 
in concurrent programming, 1024— 
1025 
header files, 83 4 
interpositioning, 707, 707-712 
shared, 19, 699, 699-701 
standard HO, 911 
static, 684, 684-688 
LIFO (last-in first-out) free list order, 
863 
<limits.h> file for numeric limit, 
declarations, 67-68, 77 ^ 
line'feed (LF) characters, 892 
line matching 
direct-mapped caches, 618 
fully assóciative caches, 626 
..set associative caches, 625-626 
line replacement 
direct-mapped caches, 619 
set associative caches, 626 
. line section, 675 
linear address spaces, 804 
link-time errors, 7 
link-time interpositioning, 708, 710 
linkers and linking, 5, 764, 170 
compiler drivers, 671-672 
dynamic, 699, 699—701 
library interpositioning, 707, 
707-712 
object files, 673, 673-674 
executable, 695—698 
loading, 697—698 
relocatable, 674—675 ae 
tools for, 713 t 
overview, 670—671 
position-independent code, 704— 
707 
relocation, 689—695 
shared libraties from applications, 
701—703 
static, 672 t 
summary, 713-714 
symbol resolution, 679-689 
symbol tables, 675-679 
virtual memory for, 811-812 r 


linking phase, 6 
links in directories, 897 
Linux operating system, 20, 45 
code segments, 697-698 
dynamic linker interfaces, 702 
and ELF, 673 
exceptions, 729-731 
files, 891-893 
signals, 756 
static libraries, 685-686 
virtual memory, 830-833 
Lisp language, 85 
listen [Unix] convert active socket 
to listening socket, 935 
listening descriptors, 936-937 
listening sockets, 935 
little-endian ordering convention, 42, 
42-44 
load effective address instruction, 
191--193, 277 
load forwarding in PIPE, 477 
load instructions; 70 
load interlocks:441 
load operations 
example, 588 
process, 519-520 
load penalty in CPI, 467 
load performance of memory; 554—555 
load program function, 750 
load-store architecture,in CISC vs. 
RISC, 362 
load time for code, 670 
load/use data hazards, 439, 439-441 
loaders, 672, 697 
loading 
concepts, 699 
executable object files, 697-698 
process, 697 t 
programs, 750-752 
shared libraries from applications, 
701-703 
virtual memory for, 812 
local area networks (LANs), 920, 
920-922 
local automatic variables, 994 
local registers, 527% 
local static variables, 994, 994-995 
local storage 
registers, 251-253 
stack, 248—251 
local symbols, 676 
locality, 73, 580, 604-605 
blocking for, 647 
caches, 643—647, 810 
exploiting, 647 
forms, 604, 614 


instruction fetches, 607-608 
program data references, 606—607 
summary, 608-609 7 
localtime function, 1024 
lock-and-copy technique, 7022, 1022 
locking mutexes * 
lock ordering rule, 7029 
for semaphores, 7003 
logic design, 372 
combinational circuits, 374—380, 
413 
logic gates, 373, 373 
memory and clocking, 381—384 
set membership, 380-381 
logic gates, 373 
logic synthesis, 355, 373, 467 
logical blocks y 
disks, 595, 595-596 
SSDs, 601 
logical control flow, 732, 732—733 
logical operations, 56-57, 191 
“discussion, 196-197 H 
load effective address, 191-193 
shift, 58, 104, 192, 194-196 
special, 197-200 
unary and binary, 194 
long [C] integer data type, 40-41, 
long double [C] extended-precision 
floating point, 725, 137 
long double floating-point 
declaration, 178 
long words in machine-level data, 179 
longjmp [C Stdlib] nonlocal jump, 
723, 783, 783 
loop registers, 527 
loop unrolling, 502, 504, 537 
Core i7, 572 
k x 1, 531, 
k x la, 544 
k x k, 539-540 
overview, 531-535 
with reassociation transformations; 
541—543 
loopback addressés, 928 
loops, 220 
do-while, 220-223 
for, 228-232 
inefficiencies, 508-512 
reverse engineering, 222 
segments,526-527 ^ !k 
for spatial locality, 643-647- 
while,223-228 
low-level instructiong..See machine- 
level programming Y 
low-level optimizations, 562 
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lowercase conversions, 509-511 

LRU (least-recentiy-used) replace- 
ment policies, 672, 626 

1s command, 892 

lseek [Unix] function, 896-897 

lvalue (C) assignable value for 
pointers, 277 


Mac OS X (Apple Macintosh) 
operating system, 27 
machine checks, 729 4 
machine code, 164 
machine-level programming « 
arithmetic. See arithmetic — ' 
arrays. See arrays 
buffer overflow. See buffer overflow 
control. See control structures 
data-flow graphs from, 525-529 
data formats, 177-179 
data movement instructions, 182- 
189 
encodings, 169-177 
floating point. See floating-point 
„code 
GDB debugger, 279-280. 
heterogeneous data structures. See 
heterogeneous data structures 
"historical perspective, 166-169 
information access, 179-180. ,. 
instructions, 4 
operand specifiers, 180-182 
overview, 164-166 
pointer principles; 278 
procedures, See procedures 
x86-64. See x86-64 microprocessors 
macros for storage allocators, 856~ 
857 re 
main memory, 9 
accessing, 587-589 
memory modules, 584 
main threads, 986. 
malloc [C Stdlib] allocate heap 
storage, 35, 324, 697: 839-840, 
840 
alignment with, 276 
declaration, 134-135 + 
dynamic memory allocation, 840- 
843 de 
interpositioning libraries, 708. 
wrappers for, 711 
man ascii command, 48 
mandatory alignment, 276 
mangling process'(C--« and Java), 680 
many-core processors, 471 
map disk-object into memory function, 
837 





mapping 
memory. See memory mapping 
variables, 994-995 
mark phase in Mark&Sweep, 867 
Mark&Sweep algorithm, 866: 
Mark&Sweep garbage collectors, 867, 
867-870 
masking operations, 55 
matrices 
adjacency, 660 Im 
multiplying, 643—647 
maximum floating-point'instructions, 
302- E 
maximum two's complement number, 
66 
maximum unsigned number function, 
63 aty 
maximum values, constants for, 68 
McCarthy, John, 866 
McIlroy, Doug, 16 
media instructions, 294 
nen, init [CS:APP] heap model, 855 
mem_sbrk [CS:APP] sbrk emulator, 
855 : 
membership, set, 380-381 
mencpy [Unix] copy bytes from one 
region of memory to another, 
1333 LP 
memory, 580 
accessing, 587—589 
aliasing, 499, 500 , 


associative; 625: 1 1 


caches. See caches and cache 
memory 

copying bytes in, 133 

data alignment in, 273-276 

data hazards, 435 


design, 384 

dynamic. See dynamic memory 
allocation 

hazards, 435, " 

hierarchy, 74, 14,°609-614 

leaks, 876, 992 


load performance, 554-555 

in logic design, 361-364 E 

machine-language procedures, 239 

machine-level programming, 170 

main, 9,584, 587-589 

mapping. See memory mapping 

nonvolatile, 587 

performance, 553-561 

pipelining, 469-470 

protecting, 289, 812-813 

RAM. See random.access memory 
(RAM) 

ROM, 587 
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threads, 993-994 
trends, 602-604 
virtual. See virtual memory (VM) 
Y86, 356 
memory buses, 587 
memory controllers, 583, 584 
memory management units (MMUS), 
804, 807 
memory-mapped I/O, 598 
memory mapping,,8/2 " 
areas, 833; 833 
execve function, 836-837 
fork function, 836 
in loading, 699 + 
objects, 833-836 
user-level, 837-839 
memory mountains, 639 
Core i7 microprocessors, 641 
overview, 639-643 ù 
memory references 
operands, 181 
out of bounds. See buffer overflow 
in performance, 514-517 
memory stage 
instruction processing, 385, 387-397 
PIPE processor, 454—455 
sequentíal processing, 400 
sequential Y86-64 implementation, 
409-411 
Y86-64 pipelining, 423 
memory system, 580 
memory utilization, 845, 845 
memset function, declaration, 134-135 
metadata, 903, 903-904 
metastable states, 581 
methods 
hypertext transfer protocol, 951- 
952 
objects, 267 
micro-operations, 519 
microarchitecture, 70, 517 
microprocessors. See central 
processing units (CPUs) 
Microsoft Windows operating system, 
45 
MIME (multipurpose internet mail 
extensions) types, 949 
minimum block size, 848 1 
minimum floating-point instructions, 
302 
minimum two’s complement number, 
66 We 
minimum values è 
constants, 68 \ 
two’s complement representation, 
66 
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mispredicted branches 
handling, 443-444 
performance penalties, 467, 520; 
549-553 
miss rates, 632 
misses, caches, 470, 6/2 
kinds, 612-613 
penalties, 632, 806 
rates, 631 
mkdir command, 892. 
mm. coalesce [CS:APP] allocator: 
boundary tag coalescing; 860 
mn, free [CS:APP] allocator: free 
heap block, 860 
mn-ijk [CS:APP] matrix multiply ijk, 
645 
nm-ikj [CS:APP] matrix multiply ikj, 
645 
mn, init [CS:APP}] allocator: initialize 
heap, 858 
mm-jik [CS:APP] matrix multiply jik, 
645 
mm-jki [CS:APP] matrix multiply jki, 
645 
mm-kij [CS:APP] matrix multiply kij, 
645 
mm-kji [CS:APP] matrix multiply kji, 
645 
mm, malloc [CS:APP] allocator: 
allocate heap.block, 860, 861 
mmap [Unix] map disk object into 
memory, 837, 837-839 
MMUS (memory management units), 
804, 807 
MMX media instructions, 167, 294 
Mockapetris, Paul, 931 
mode bits, 735 
modern processor performance, 
518-531 
modes 
kernel, 726, 728 
processes, 734—736, 735 
user, 726, 728 
modified sequential processor 
implementation, 421-422 
modular arithmetic, 85-86,.89 
modules 
DRAM, 584, 585 
object, 673 
monitors, Java, 1010 
monotonicity assumption, 846 
monotonicity property, 124 
Moore, Gordon, 169 
Moore's Law, 769, 169 
MOSAIC browser, 949 
motherboards, 9 


Motorola RISC processors, 363 

Mov [instruction class] move data, 182, 
182-183 i 

movabsq [x86-64} move absolute quad 
word, 183, 183 

movb [x86-64] move byte, 183- 

move absolute quad word instruction, 
183, 183 

move aligned, packed double 
precision instruction, 296 

move aligned, packed single precision 
instruction, 296 1 

move and sign-extend instruction, 754, 
185 

move byte instruction, 183 

move data instructions, 182-189 

move double precision instruction, 
296 

move double word instruction, 183 

move if even parity instruction, 324 

move if greater instruction,217, 357 

move if greater or equal'instruction, 
217, 357: 

move if less instruction, 217, 357 

move if less or equal instruction, 217, 
357. 

move if negative instruction, 217 

move if nonnegative instruction, 217 

move if not equal instruction, 217, 
357 

move if not greater instruction, 217 

move if not greater or equal 
instruction, 217 

move if not less instruction, 217 

move if not less or equal instruction, 
217% 

move if not unsigned greater 
instruction, 217 

move if not unsigned less instruction, 
217 

move if not unsigned less or equal 
instruction, 217 

move if not zero instructidn;217 

move if unsigned greater instruction, 
217 

move if unsigned greater or.equal 
instruction: 217 

move if unsigned less instruction, 217 

move if unsigned less*o? equal 
instruction, 217 

move if zero instruction, 217 

move instructions, conditional, 214— 
220, 550—553' 

move quad wotd instruction, 183 

move sign-extended byte to double 
word instruction, 185 


move sign-extended byte to quad 
word instruction,185 

move sign-extended byte to word 
instruction, 185 

move sign-extended double word to: 
quad word instruction, 185 

move sign-extended word to double 
word instruction, 185 

move sign-exterfded word to quad 
word instruction, 185 

move single precision instruction; 296 

move when equal instruction, 357 

move with zero extension instruction, 
184,184 

move word instruction, 183- 

move zero-extended byte to^double 
word instruction, 184 

move zero-extended byte to quad 
word instruction, 184 

move zero-extended byte to word 
instruction, 184 

move zero-extended word to double 
word instruction, 184 D 

move zero-extended word to quad 
word instruction, 184 

movement operations, floating-point 
code, 296—301 ' 

mov1 [x86-64] move double word, 183 

movq [x86-64] move quad word, 183 

MOVS [instruction class] move and 
sign-extend, 184, 185 

movsb1 [x86-64] move sign-extended 
byte to double word, 185 

movsbq [x86-64] move sign-extended 
byte to quad word, 185 

movsbw [x86-64] move sign-extended 
byte to word, 185 

movslq [x86-64] move sign-extended 
double word to quad word, 185 

movsw1 [x86-64] move sign-extended 
word to double"word, 185 

movswq [x86-64] move sign-extended 

» word to quad word, 185 

movw [x86-64] move word, 183 

MOVZ [instruction class] move with. 
zero extension, 784, 184 

movzb1 [x86-64] move zero-extended 
byte to double word, 184 

movzbq [x86-64] move zero-extended 
byte to quad word, 184 

movzbw [x86-64] move zero-extended 
byte to word, 184 » 

movzwl'[x86-64] move zero-extended 
word to double word, 184 

movzwq'[x86-64] move zero-extended 
word to quad word, 184 + 


nrmovq instruction, 404 
mulq [x86-64] unsigned multiply, 798, 


198 
multi-core processors, 76, 24-25, 168, 
605, 972 2 


multi-level page tables, 819—821 
multi-threading, 17-18, 25 
Multics, 16 
multicycle instructions, 468—469 
multidimensionalarrays,258-260 , 
multiple accumulators in parallelism, , 
536-541 
multiple zone recording, 592 
multiplexing, /O, 973 i 
concurrent programming with, 
978-985- 
event-driven servers based on, 
980-985 
pros and cons,985 
multiplexors, 374, 374-375 
HCL with case expression, 378 
word-level, 378-380 
multiplication 
constants, 101-103 
floating point, 124,302 
instructions, 198 
matrices, 643-647 
two’s complement, 97-101 
unsigned, 96-97, 198, 198 
multiply instruction, 192 
multiported random access memory, 
382 
multiprocessor systems, 24 
multipurpose internet mail extensions 
(MIME).types, 949 j 
multitasking, 733 
multiway branch statements, 232—238 
munnap [Unix] unmap disk object, 839 
mutexes è 
lock ordering rule, 7029 
Pthreads, 1010 
for semaphores, 7003 
mutual exclusion 
progress graphs, 7000 
semaphores for, 1002-1004 
mutually exclusive access, 1000 


\n (newline character), 3, 897: 

n-gram statistics, 565. » 

named pipes, 892 

names 
domain, 925, 927—929 ` 
mangling and demangling processes 

(C++ and Java), 680, 680 

protocols, 922 
types, 47 


Y86-64 pipelines, 427 
NaN (not a number) 
constants, 124 
floating point, 306 
representation, 114, 775 
nanoseconds (ns), 502 
National Science Foundation {NSF}; 
931 
need_regids signal, 405 
need_valC signal, 405 
NEG [instruction class] negate, 192+ 
negate, instruction; 192 2 
negation, two’s complement, 95 
negative overflow, 90, 90-91 u 
nested arrays, 258-260 
nested structures, 268 
network adapters, 597. 
network byte order, 925 
network clients, 27, 978 15 
Network File System (NFS), 610 
network programming, 918 
client-server model; 918-919 
Internet. See Internet 
networks, 919-923 
sockets interface. See sockets, 
interface 
summary, 964-965 
Tiny Web server, 956-964 
Web servers, 948-956, 
network servers, 27, 978 
networks, 20-21 
acyclic, 374 
LANs, 920, 920-922. 
WANs, 921, 921-922 
never taken (NT) branch prediction 
strategy, 428 
newline character (n), 3, 897 
next-fit block placement policy, 849, 
849 
nexti command, 280 d 
NFS (Network File System), ay 
NM tool, 713 
no-execute (NX) memory TORE 
289 
no operation nop instruction, 286,404 
instruction code for, 405: 
pipelining, 430-431 t 
in stack randomization, 286 
no-write-allocate approach, 630 
nodes, root, 866 
nondeterminism, 748 
nondeterministic behavior, 748 
nonexistent variables?referencing, 874 
nonlocal jumps, 723, 781, 781—786 
nonuniform partitioning, 416-418 
nonvolatile memory, 586 
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nop [x86-64] no operation instruction, 
286, 404 b 
instruction code for, 405 ,« 
pipelining, 430-431 n 
in stack randomization, 286 
nop sleds, 286 
norace.c [CS:APP] Pthreads 
program without a race, 7027 
normal operation status code, 364, 404 
normalized values, floating-point, 723, 
113-114 
northbridge chipsets, 588 
not a number (NaN) 
constants, 124 
floating point, 306 
representation, 114, 775 
NOT [instruction class] complement, 
192 
NOT operation 
Boolean, 51-52 
C operators, 56:57 
logic gates, 373 
ns (nanoseconds), 502, 
NSF (National Science Foundation), 
931 
NSFNET, 931 
NSLOOKUP.program, 928 
ntohl [Unix] convert network-to-host 
long, 925 
ritohs [Unix] convert network-to-host 
short, 925 
number systems, conversjons. Sees 
conversions 
numeric limit declarations, 77 
numeric ranges 


C standards,,61 
integral types, 60-62 
Java standard, 68 z 
NX (no-execute) memory protection, 
289 


. files, 173, 672 
-01 optimization flag, 170 
-02 optimization flag, 170 
OBIDUMP GNU machine-code file : 
reader, 173, 279, 692, 713 
object code, 170, 173 
object files, 173 
executable. See executable object 
files 
formats, 673 ' ^ 
forms, 673 
relocatable, 5,672, 673-675 
shared, 673 
tools, 713 
object modules, 673 
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objects 
C++ and Java, 266 
memory-mapped, 833-836 
private, 834, 834 f 
prograth, 34 
shared, 699, 833-836 
as struct, 266-267 
oct word, 197, 197-198 
OF [x86-64] ‘overfldw flag condition 
code;201, 355 
off-by-one errors, 872 
offsets i 
GOTs, 705, 705-707 
memory references, 181 
PPOs, 814 
unions, 270° 
VPOs, 8I4 
-0g optimization flag, 170, 563 
one-operand multiply instructions, 
198 j "hs 5 
ones’-complement representation, 68 
open [Unix] open file*897, 893-895 
open_clientfd [CS:APP] establish 
connection with server, 942, 
942-944 
open_listenfd [CS:APP] establish a 
listening socket, 944, 944 
open operations for files, 891, 893-895 
open shared library function, 707 
open‘source operating systems, 86-87 
opendir functions, 905 
operand specifiers, 180-182 
operate instruction, 70 
operating systems (OS), 75 n 
files, 19 
hardware managément, 14415 
kernels, 79 
Linux, 20, 45 s 
processes;/15-17 
threads, 17-18 
Unix, 35 
virtual memory, 18-19 
Windows, 45 
operations 
bit-level, 54-56 oC 
logical, 56-57 
shift, 57-59 
optest script, 465 
optimization 
address translation, 830 
compiler, 170 
levels, 498 
program performance. See 
performance 
optimization blockers, 496—497, 500 
OPTIONS method, 951 


OR [instruction class] or, 192 
OR operation 
Boolean, 51-52 
C operators, 56-57 
HCL expréssions, 374-375. 
logic gates, 373° EF 
order; bytes, 42-49 
disassembled code, 210 
network, 925 
unions, 272 
origin servers, 952 
OS. See operating systems (OS) 
Ossanna, Joe, 16 
out-of-bounds memory references. 
See buffer overflow 
out-of-order execütion, 518+ 
five-stage pipelines, 471 
history, 522 
overflow 
arithmetic, 87, 87-89, 134 
buffer. See buffer ovefflow 
floating-poifitivalues, 127 
identifying, 92-93 
infinity representation, 115 
multiplication, 102 
negative, 90, 90-91 
operations, 32 
positive, 90, 90-91 
overflow flag condition code, 207, 355 
overloaded funttions (C++ and Java), 
680 


t it 


P semaphore operation, 1001, 1001- 
1002 
P [CS:APP] wrapper function for 
Posix sem, wait, 7002 
P6 microarchitecture, 167 
PA ‘(physical addresses), 803 
vs. virtual, 803-804 
Y 86-64, 356 
packages, processor, 825 
packet headers, 922 
packets, 922  s'^ 
padding 
alignment, 274-275 t 
blocks, 847 ^ ` 
page faults 
DRAM. caches; 808; 808-809 
Linux/x86-64 systems, 729, 832- 
833 
memory caches, 470 ~= ^ 
pipelining caches; 808: 
page frames, 805 
page hits in cathes, 808 
page table base registers (PTBRs), 
814 t s [od 








page table entries (PTEs), 807, 
807—808 
Core47, 826-828 
TLBs for, 817-821, 823 
page table entry addresses (PTEAs), 
817 /-5 : 
page tables, 736, 823 
caches, 806-808, 807 
multi-level, 819-821 
paged-in pages, 809 
Paged-out pages, 809 
pages 
allocation, 810 
demand zero, 833 M 
dirty; 827 
physical, 805, 805-806" 
SSDs, 601° 
virtual, 289, 805, 805-806 +’ 
paging Z2 . 
demand, 810 x 
description, 809 
parallel execution, 734 
parallel flows, 734, 734 
parallel programs; 7013 
parallelism, 24/536 
instruction-level, 26, 497, 518, 562 
multiple accumulators, 536-541 
reassociation transformations, 
541-546 
SIMD, 26, 546-547: 
thread-level, 26 
threads for, 1013-1018 
parent directories, 892 
parent processes; 739; 739—740 
parity flag coridition'cdde, 478, 306 
parse uri [CS:APP] Tiny-helper 
function, 960 
parseline [CS:APP] shell helper 
routine, 756 
partitioning 
addresses, 615-616 
nonuniform in pipelining, 416—418 
passing data 
machine-language procedures, 239 
pointers to structures, 266 
pathnames, 893 
Patterson, David, 361, 471 
pause [Unix] suspend until signal 
arrives; 4/50 
payloads 
aggregate, 845 
Ethernet, 920 L 
* protocol, 922 
PC. See program counters (PCs) 
PC-relative addressing 
jumps, 207, 207—209 





symbol references, 690, 692-693 
Y86-64, 359 
PC selection stage in PIPE processor, 
447-449 
PC update stage 
instruction processing, 385, 387-395 
sequential processing, 400 
sequential-Y86-64 implementation, 
411 
PCI (peripheral component 
interconnect), 598 
PCIe (PCI express), 598 . 
PE (Portable Executable) format, 673 
peak utilization metric, 844—845, 845 
peer threads; 986 
pending bit vectors, 759 
pending signals, 758 
Pentium II microprocessor, 167 
Pentium III microprocessor, 167-168 
Pentium 4 microprocessor, 168 
Pentium 4E microprocessor, 168 , 
Pentium microprocessor, 167 
PentiumPro microprocessor, 167, 522 
performance, 6 
Amdahl’s law, 22-24, 
basic strategies, 561—562 
bottlenecks, 562—568 
branch prediction and mispredic- 
tion penalties, 549-553 
caches, 553, 631—633, 639-647 
compiler capabilities and 
limitations, 498-502 
expressing, 502-504 
limiting factors, 548-553 
loop inefficiencies, 508-512 
loop unrolling, 534, 531—535 
memory, 553-561 
memory references, 514—517 
modern processors, 518-531 
overview, 496-498 
parallelism. See parallelism 
procedure calls, 512-513 
program example, 504—508 
program, profiling, 562-564 
register spilling, 548-549 
results,summary, 547-548 
sequential Y86-64 implementation, 
412 
"summary, 568—569. 
Y86-64 pipelining, 464—468 
periods (.) in dotted-decimal notation, 
926 
persistent connections in HTTP, 952 
PF [x86-64] parity flag condition code, 
178, 306 
physical address spaces, 804 





physical addresses (PA), 803 
vs. virtual, 803-804 
Y86-64, 356- . 
physical page numbers (PPNs), 874 
physical page offset (PPO), 874 
physical pages (PPs), 805, 805-806 
piin floating-point representation, 140 
PIC (position-independent code), 704 
data references, 704—705 
function calls, 705—707 
picoseconds (ps), 413, 502 
PIDs (process IDs), 739 
pins, DRAM, 582—583 
PIPE— processor, 421, 422, 426-430 
PIPE processor stages, 439-440, 447 
decode and write-back, 449-453 
execute, 453-454 
memory, 454-455 
PC selection and fetch, 447-449 
pipelining, 26, 215, 412 
bubble, 434 
combinational, 412-414 
deep, 418-419 
diagram, 473 
five-stage, 471 
functional units, 523-524 
instruction, 549 
limitations, 416-418 
nonuniform partitioning, 416—418 
operation, 414-416 
registers, 413, 427 
store operation, 555-556 
systems with feedback, 419-421 
Y86-64. See Y86-64 pipelined 
implementations 
pipes, 977 
Pisano, Leonardo (Fibonacci), 32 
placement 
memory blocks, 847, 849 
policies, 612, 849 
platters, disk, 590, 591 
PLT (procedure linkage table), 706, 
706—707 
PMAP tool, 786 
point-to-point connections, 929 
pointers, 34 ut 
arithmetic, 257—258, 873 
arrays relationship to, 48, 277 
block, 856 
creating, 48, 188 
declaring, 41 
dereferencing, 48, 188, 257, 277, 
870-871 
examples, 188 
to functions, 278 
machine-level data, 177 
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principles, 278 
role, 36 
stack, 239 
to structures, 266 
virtual memory, 870-873 
void*,48 
polynomial evaluation, 530, 530, 
572-573 
pools of peer threads, 987 
pop instructions in x86-64 models, 372 
pop operations on stack, 189, 189-191 
popa [Y86-64] pop instruction, 790, 
behavior of, 371 
code for, 404 
run-time stack, 239 
portability and data type size, 41 
Portable Executable (PE) format, 673 
portable signal handling, 774-775 
ports 
Ethernet, 920 
Internet, 930 " 
VO, 598 
register files, 382 1 
.pos [Y86-64] directive, 366 
position-independent code (PIC), 704 
data references, 704—705 
function calls, 705-707 
positive overflow, 90, 90-91 
posix, error [CS:APP] reports 
Posix-style errors, 2043 
Posix standards, 16 
Posix-style error handling, 7042, 1043 
Posix threads, 987, 987—988 
POST method, 951-953 
PowerPC 
processor family, 352, 361 
RISC design, 361-363 
powers of 2, division by, 103-107 
PPNs (physical page numbers), 814 
PPO (physical page offset), 874 
PPs (physical pages), 805, 805-806 
precedence of shift operations, 59 
precision, floating-point, 113, 137 
prediction 
branch, 215 
misprediction penalties, 540—553 
Y86-64 pipelining, 422, 427-429, 
preempted processes, 733 
prefetching mechanism, 641-642 
prefix sums, 502, 503, 561,573 
prepare stack for return instruction, 
292 
preprocessors, 5, 170 
prethreading, 1005-1013, 7008 
primary inputs in logic gates, 374 
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principle of locality, 604, 604 
print command, 280 
print getaddrinfo error message 
function, 938 
printf [C Sidlib] formatted printing 
function 
formatted printing, 47 
numeric values with, 75 
printing, formatted, 47° 
priorities 
PIPE processor forwarding sources, 
451-452 
write ports, 408 
private address spade, 734 
private areas, 834 
private copy-on-write structures, 836 
private declarations (C++ and Java); 
677 
private objects, 834, 834 
privileged instructions, 735 
/pxoc filesystem, 735, 735-736, 786 
procedure linkage table (PLT), 706, 
706—707 
procedure return instruction, 357 
procedures, 238-239 
call performance, 512-513 
control transfer, 241—245 
data transfer, 245-248 
floating-point code in, 301—302 
recursive, 253—255 
register usage conventions, 251—253 
run-time stack, 239-241 
process contexts? 16, 736 ” 
process graphs, 741, 742 
process groups, 759 ! 
process IDs, 739 
process tables, 736 
processes, 15, 732, 738 
background, 753 
child, 740 
concurrent flow,732—734, 733 
concurrent programming with, 


973-977 
concurrent servers based.on, 
974-975 r 


context switches, 736—737 
creating and terminating; 739—743 
default behavior, 744 

error conditions, 745—746 

exit status, 745 

foreground, 753 

group, 759 

IDs, 739—740 

loading programs, 699, 750-752 
overview, 15-17 


parent, 739, 740 a 
preempted, 733 
private address spaces 734 
Vs. programs, 753 
pros and coná, 975 
reaping, 743, 743-749 
running programss'750-756 
sleeping, 749—750 
tools, 786-787 
user and kernel modes, 734—735 
waitpid function, 746—749 
zombie, 743 
processor-memory gap, 13, 604 
procéssor packages, 825 1 
processor states, 723 
processors! See central processing 
units (CPUs) 
producer-consumer próblem, 7004, 
1005-1006 
profilers code, 497 
profiling, program, 562-564 
program counters (PCs), 9, 44 
in fetch stage, 384 
hazards, 435 
machine-language procedutes, 239 
“rip, 171 
SEQ timing, 401 
Y86-64 instruction set architecture, 
356 
Y86-64 pipelining; 423, 427-429 
program data references locality, 
606-607 
progrant header tables, 696, 696 
program registers 
clocked, 381-384 
data hazards, 435 
Y86-64, 355-356 ! 
programmable ROMs (PROMs), 587 
programmer-Visible state, 355, 355- 
356 
programs 
code and data, 76 
concurrent. See concurrent 
programming ™ 
forms, 4-5 
loading arid running, 750-752 
machine-level. See machine-level 
programming 
objects, 34 
vs. processes, 753- 
profiling, 562-564 
running, 10-12, 753-756 
Y86-64, 364-370 
progress graphs, 999, 999-1001 
deadlock regions, 1027-1028; 1028 
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forbidden regions, 1003 
limitations, 1004 
prologue blocks, 855 h 
PROM s (programmable ROMs), 587 
protection, memory, 812-813 
protocól software, 922 
protocols, 922 
proxy caches, 9527 ? 
proxy chains, 952 
ps (picoseconds), 473, 502 
Ps tool, 786 
pseudorandom number generator 
functions, 7021 
psum-attay. c [CS:APP] parallel sum 
program using array, 7016 
psum-local.c [CS:APP] parallel sum 
program using local variables, 
1017 
psun-mutex.c [CS:APP] parallel sum 
program using mutex, 7015 
PTBRs (page table; base registers), 
814 ` i 
PTEAs (page table entry addresses);! 
817 
PTEs (page table entries), 807, 
807-808 
Core i7, 826-828 
TLBs for, 817-821, 823 f 
pthread cancel [Unix] terminate 
another thread, 989 
pthread create [Unix] crehte a 
thread, 988 
pthread_detach [Unix] detach 
thread, 990, 990 
pthread 'exit [Unix] terminate 
current thread, 959 
pthread, join [Unix] reap a thread, 
989 


pthread onde [Unix]'initializé^a 
thread, 990,1012 * we 

pthread_self [Unix] get thread ID, 
988 

Pthreads, 987, 987-988, 1010 

public declarations (C++ and Java), 


677 
push instructions in x86-64 models, 
$372 za 


push operations on stack, 789, 189-191 
pushq [x86-64] push quad word, 173, 
190, 190, 357 Un 
Code for, 404 ri 
processing steps, 370—371, 392 
run-time stack, 239 
PUT method in HTTP, 951 


“put to" operator (C++), 890 
d 


qsort function, 566 
quad words, 177 
QuickPath interconnect, 588, 826 
quit command,280  &. 
R_X86_64_32 (absolute addressing), 
691 
R_X86_64_PC32 (PC-relative 
addressing), 690 
symbol table entry, 677 
and Unix, 673 
%r8 [Y86-64] program sealers 180, 
355 


%r8d [x86-64] low order 32 bits of 
register 4r8, 180 

Araw [x86-64] low order 16 bits of 
register 4r8;:180 

4x9 [Y86-64] program register, 180, 
355 

4r9d [x86-64] low order 32 bits of 
register 4r9, 180 

4r9w [x86-64] low order 16 bits of 


register 4r9, 180 nr 
410 [Y86-64] program register; 180, 
355 
%r10d [x86-64] low order 32 bits of 
register 4r10, 180 n 
4r10w [x86-64] low order 16 bits of 
register, Zr10, 180 t 
%xr11 [Y86-64] program register, 180, 
355 


Xr11id [x86-64] low order 32 bits of 
register Arif, 180 

%r11w [x86-64] low order 16 bits of 
register 4r11, 180 

4112 [Y86-64] program register, 180, 
355 

Xr12d [x86-64] low order 32 bits of 
register 4r12, 180 

*x12w [x86-64] low order 16 bits of 
register 4r12, 180 

4x13 [Y86-64] program register, 180, 
355 

4ri3d [x86-64] low order 32 bits of 
register 4r13, 180 a 

%r13w [x86-64] low order 16 bits of 
register 4r13, 180 

4x14 [Y86-64] program register, 180, 
355 

%xr14d [x86-64] low order 32 bits of 
register 4x14, 180 

%r14w [x86-64] low order. 16 bits of 
registen%ri4, 180 

4015 [x86-64] program register, 180, 
355 1 





%r15d [x86-64] low order 32 bits of 
register %r15, 180 
%015w [x86-64] low order 16 bits of 
register %r15, 180s 
race.c [CS:ABP] program.with a 
race, 1025 
race conditions, 776, 992 
concurrent programming, 7025, 
1025-1027 
signals, 776—778 
RAM. See random access memory 
(RAM) 
rand [CS:APP] pseudorandom 
number generator, 1021, 1024 
rand, r function, 1024 
random access memory (RAM), 381, 


581 x 
dynamic. See dynamic; RAM 
(DRAM) 


multiported, 382 
processors, 384 
SEQ timing, 401 ' 
static. See static RAM (SRAM) 
random operations in SSDs, 600 
random replacement policies, 612 — 5 
ranges 
asymmetric, 66, 77 E 
bytes, 36 
constants for, 67-68 
data types, 40 
integral types, 60-62 
Java standard, 68 
RAS (row access strobe) requests, 583 
Zxrax [Y86-64] program register, 180, 
355 
Arbp Maig program register, 180, 
355 


Zrbx [Y86-64] program register, 180, 
Arex SAPR program register, 180, 
“rdi nie: 64] program register, 180, 
rdx A -64] program register, 180, 
E graphs, 866 

reachable nodes, 866 


read access, 289 j. 
read and echo input'fines function, 
947 i 


read bandwidth, 639 

read environment variable function, 
751 

read/evaluate steps, 753 

read [Unix] read file, 895, 895-897 
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read-only memory (ROM), 586 
read-only register, 527 
read operations 
buffered, 898, 900-901 
disk sectors, 597-599 
file metadata, 903-904 x 
files, 897, 895-897 i 
SSDs, 601 
unbuffered, 897-898 
uninitialized memory, 871 
read ports, 382 
read requesthdrs [CS:APP] Tiny 
helper function, 960 
read sets, 978 
read throughput, 639 19% 
read transactions 
descriptions, 587 
example of, 588-589 
read/write heads, 592 
readdir functions, 905. 
READELF GNU object filé reader, 678, 
713 
readers-writers problem, 1006, 1008 
reading 
directory contents, 905-906r 
disk sectors, 597« 
readline function, 903 


readn function, 903 1 
ready read descriptors, 978 
ready sets, 978 4 


realloc function, 841 
reap thread function, 989 
reaping 
child processes, 743, 743-749 
threads, 989 
rearranging signals in pipelining, 
426-427 
reassociation transformations, 547, 
541-546, 570 
receiving signals, 758, 762—764 
recording density, 597 
recording zones, 592 
recursive procedures, 253-255 
redirection of I/O, 909, 909-910 
reduced instruction set Computers 
(RISC), 361 
vs. CISC, 361-363 f 
SPARC processors, 471 
reentrancy issues, 1023-1024 
reentrant functions, 766, 1023 à, 
reference bits, 827 
reference counts, 906 an 


reference machines, 507 x 


referencing 
data in free heap blocks, 874-875 
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referencing (continued) 
nonexistent variables, 874 
refresh, DRAM, 582 
regions, deadlock, 1027-1028, 7028 
register files, 70, 358 
contents, 382—383, 521 
purpose, 358-359 
SEQ timing, 401 
register identifier (ID), 358-359 
register operands, 181 
register specifier bytes in Y86-64 
instruction, 358 
register to memory move instruction, 
356 
register to register move instruction, 
356 
registers, 9 
clocked, 387 
data hazards, 435 
data transfer, 245-248 1 
hardware,'381—384 
local, 527 
local storage, 251—253 
loop, 527 
pipeline, 413, 427 
program, 355—356; 381—384, 435 
read-only, 527 
register files, 171 
renaming, 522 
spilling, 548—549 
updating conventions, 179 
write-only, 527 
x86-64 integer, 179, 179-180 
Y86-64, 359, 422-426 
regular files, 833, 891 
.rel.data section, 675 
.rel.text section, 675 
relabeling signals: 426-427 
relative pathnames,'893 
relative speedup in parallel programs, 
1019 
reliable connections; 930 
relocatable object files, 5, 672, 673- 
675 
relocation,’673, 689-690 
algorithm, 691 
entries, 690, 690-691 
PC-relative references, 692-693 
practice problems,694-695 
remove item from bounded buffer 
function, 7007 
renaming registers, 522 
rep [x86-64] string repeat instruction 
used as no-op, 208 
replacenient policies, 613 
replacing blocks, 672 


report shared library error function, 
702 
reporting errors,-1043 
request headers in HTTP,951 
request lines in HTTP, 957 
requests 
client-server model, 978 
HTTP, 951, 951-952 
requests for comments (RFCs), 965 
reset configuration in pipelining, 460 
resident sets; 870 , 
resources 
client-server model, 978 
shared, 1004—1008 
RESP [Y86-64] register ID for %rsp, 
404 
response bodies in HTTP, 952 
response headers in HTTP, 952 
response lines in HTTP, 952 
responses 
client-server model, 918 
HTTP, 952, 952-953 
restart .c'[CS:APP] nonlocal jump 
example, 785 ., " 
restrictions, alignment; 273-276 
ret [Y86-64] procedure return, 357 
ret [x86-64] return from procedure 
call, 208, 241-242 
ret instruction, 404 
processing steps, 395 
Y86-64 pipelining, 428-429, 455- 
457, 461—462- 
retiming circuits, 427 
retirement units, 521  , 
retq [x86-64] return from procedure, 
241 d 
return addresses, 241 
predicting, 429 
procedures, 240 
return penalty in CPI, 467 
reverse engineering 
loops, 222 
machine code, 765 
revolutions per minute (RPM); 590. 
RFCs (requests for comments),965 
ridges in memory mountains; 641 
right hoinkies (»), 910 
right shift operations, 57-58, 192 
rings, Boolean, 52! 
RIO [CS:APP] Robust I/O package, 
897 
buffered functions, 898-902 
origins, 903 
unbuffered fufictions, 897—898 
rio. read [CS:APB] internal. read 
function, 901 


rio readinitb [CS:APP] init read 
buffer, 898, 900 
rio_readlineb [CS:APP] robust 
buffered read, 898, 902 
rio_readn [CS:APP] robust 
unbuffered read, 897, 897-899, 
901, 903 
rio_readnb [CS:APP] robust 
buffered read, 898, 902 
rio_t [CS:APP] read buffer, 900 
rio_writen [CS:APPJ robust’ i: 
unbuffered write, 897, 897-899, 
903 
rip [x86-64] program counter, 171 
“rip program counter, 771 
RISC (reduced instruction set 
computers), 361 
vs. CISC, 361-363 
SPARC processors, 471 
Ritchie; Dennis, 2, 4, 16, 35, 914 
rmdir command, 892 
rmmovq ['Y86-64] register'to memory 
move, 356, 390, 404 
RNONE [Y 86-64] ID for indicating no 
register, 404 
Roberts, Lawrence, 931 
robust buffered read'functions, 898, 
902 
Robust I/O (R10) package, 897 
buffered functions, 898-902 
origins, 903 
unbuffered functions, 897-898 
robust unbuffered read function, 897, 
897-899 
robust unbuffered write function, 897, 
897-899 
.rodata section, 674 
ROM (read-only memory), 586 
root directory, 892 
root nodes, 866 
rotating disks term, 597 
rotational latency of disks, 594 
rotational rate of disks, 590 
round-dówn mode, 727, 121 
round-to-even mode, 120, 120-121, 
124 i i 
round-to-nearest mode, 720, 120 
round-toward-zeto mode, 4/20, 120- 
121 
round-up mode, 721,121 
rounding 
in division, 105-106 
floating-point«representation, 
120-122 
rounding modes, 720, 120-122 
routers, Ethernet, 921 


routines, thread, 987 H 
row access strobe (RAS) requests, 583 
row-major array order, 258, 606 
row-major sum function, 635, 635 
RPM (revolutions per minute), 590. 
rrmovq [Y86-64] register to register 
move, 356, 404 i 
4rsi [x86-64] program register, 180 
'xsp [Y86-64] stack pointer program 
register179-180,355 

run command; 280 
run concurrency, 733 
run time 

interpositioning, 710-712 

linking, 670 

shared libraries, 699 

stacks, 171, 239-241 
running 

in parallel, 734 

processes, 739 i1 ^ 

programs, 10-12, 750—756 


. S assembly language files, 672 

SA [CS:APP] shorthand for struct 
sockaddr, 933 t | 

SADR [Y86-64] status codé for address 
exception, 404 

safe optimization, 498, 498-499 

safe signal handling, 766—770 

safe trajectories in progress graphs, 
1000 

safely emit error message and 
terminate instruction, 766, 
768 

safely emit long int instruction, 766, 
768 

safely emit string instruction, 766, 768 

SAL [instruction class] shift left, 192 

salb [x86-64] shift left, 195 

salq [x86-64] shift left, 195 — ^ 

salw [x86-64] shift left, 195 

Sandy Bridge microprocessor, 168 

SAOK [Y86-64] status code for normal 
operation, 404 ^ 

SAR [instruction class} shift arithmetic 
right, 192, 195 

SATA interfaces, 597 

saturating arithmetic, 134 

sbrk [C Stdlib] éxtend the heap, 847, 
841 

emulator, 855 
heap memory, 850 

Sbuf [CS:APP]'shared bounded 
buffer package, 1005, 1006 

sbuf_deinit [CS7APP] free bounded 
buffer, 1007 


sbuf_init [CS:APP] allocate and init 
bounded buffer, 7007 
sbuf_insert [CS:APP] insert item in 
a bounded buffer, 7007 
suf remove {CS:APP] remove item 
from bounded buffer, 1007: 
sbuf t [CS:APP] bounded buffer 
used by Seur package, 7006 
scalar code performance summary, 
547-548 
scalar format data, 294 
Scalar.instructions, 296 
scale factor in memory references, 181 
scaling parallel programs, 1019, i 
1019-1020+ 
scanf function, 870-871 
schedule alarm to self function, 762 
schedulers, 736 “a 
scheduling, 736 f= 
events, 763 
shared resources; 1004-1008 
SCSI interfaces, 597. 
SDRAM (synchronous DRAM), 586 
second-level domain names, 928 
second readers-writers problem, /008 
sectors, disk, 590, 590-592 
access time, 593-595 
gaps, 596 
reading, 597-599 
security monoculture, 285 
security vulnerabilities;7 
getpeername function, 86-87. 
XDR library, 100 
seeds for pseudorandom nuinber 
generators, 1021 
seek operations, 593, 891 
seek time for disks, 593, 593 
segmentation faults, 729 
segmented addressing, 287—288 
segments 
code, 696, 697—698 ? 
data, 696 
Ethernet, 920, 920 
loops, 526-527 
virtual memory, 830. pa 
segregated fits, 863; 864—865 — 1 
segregated free lists, 863-865 
segregated storage, 863 
select [Unix] wait for I/O evénts; 
977 ) 
self-loops, 980 
self-modifying code, 435 v 
sem, init [Unix] initialize semaphore, 
1002 
sem post [Unix].V operation, 1002 
sem wait [Unix] P operation, 1002 
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Semaphores, 7001, 1001-1002 
concurrent server example, 1005- 
1013 
for mutual exclusion, 1002-1004 
for ‘scheduling shared resources, 
1004-1008. 
sending signals,,735, 759—762 
separate compilation, 670 
SEQ» pipelined implementations, 
421, 421-422 
SEQ Y86-64 processor design: 
See sequential Y86-64 
implementation 
sequential.circuits, 387 % 
sequential execution, 200-201 
sequential operations in SSDs, 600 
sequential reference patterns, 606 
sequential Y86-64 implementation, 
384, 421 
decode and writé-back stage, 
406—408 : 
execute stage, 408—409 
fetch stage, 404—406 à 
hardware structure, 396-400 
instruction processing stages, 
384—395 
memory stage, 409—41* EA 2 
PC update stage, 411 
performance, 412 
SEQ+ implementations, 427, 
421-422 
timing, 400—403 
Serve, dynamic [CS:APP]/TiNv 
helper function, 963-964 
serve, static [CS:APP] Tiny helper 
function, 961—963 
servers, 27 
client-server model, 978 
concurrent. See concurrent servers 
network, 21 
Web. See Web servers. 
service conversions in sockets 
interface, 937—942 
services in client-server model, 9/8 
serving 
dynamic content, 953-954 
Web content, 949 
set associative caches, 624 
line matching and word selection, 
625-626 
line replacement, 625 
set selection, 625, 625 
set bit in descriptor set macro, 978 . 
set index bits, 615, 615-616 
set on equal instruction, 203 
set on greater instruction, 203 
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set on greater or equal instruction, 203 
set on less instruction, 203 
set on less or equal instruction, 203 
set on negative instruction, 203 
set on nonnegative instruction, 203 
set on not equal instruction, 203 
set on not greater iristruction, 203 
set on not greater or equal instruction, 
203 
set on not less instruction, 203 
set on not less or equal instruction, 
203 
set on not zero instruction, 203 
set on unsigned greater instruction, 
203 
set on unsigned greater or equal 
instruction, 203 
set on unsigned less instruction, 203. 
set on unsigned less or equal 
instruction, 203 
set on unsigned not greaterinstruction, 
203 
set on unsigned not less instruction, 
203 
set on unsigned not less or equal 
instruction, 203 
set on zero instruction, 203 
set process group ID function, 759 
set selection 
direct-mapped caches, 618 
fully associative caches; 625 
set associative caches, 625 
seta [x86-64] set on unsigned greater, 
203 ` 
setae [x86-64] Set on unsigned greater 
or equal, 203 
setb [x86-64] set on unsigned less, 203 
setbe [x86-64] set on unsigned less or 
equal, 203 
sete [x86-64] set on equal, 203 
setenv [Unix]. create/change 
environment variable, 752 
setg [x86-64] set'on greater, 203 
setge [x86-64] set on greater or equal, 
203 
set jmp [C Stdlib] init nonlocal jump, 
723, 781, 783 
setjmp.c [CS:APP] nonlocal jump 
» example, 784 
set1 [x86-64] set on less, 203 
setle [x86-64] set on less or equal, 
203 
setna [x86-64] set on unsigned not 
greater, 203 
setnae [x86-64] set on unsigned not 
less or equal, 203 


setnb [x86-64] set on unsigned not 
less, 203 

setnbe [x86-64] set on unsigned not 
less or équal, 203 

setne [x86-64] set oñ not equal, 203 

setng [x86-64] set on not greater, 203 

setnge [x86-64] set on'not greater or 

«equal, 203 

setnl [x86-64] set on not less, 203 

setnle [x86-64] set on not léss or 
equal, 203 

setns [x86-64] set on nonnegative, 
203 

setnz [x86-64] set on not zero, 203 

setpgid [Unix] set process group ID, 
759 


sets. 
vs. cache lines, 634 
membership, 380—381: 
sets [x86-64] set on negative, 203 
setz [x86-64] set on zero, 203 
SF [x86-64] sign flag condition code, 
201,355 3 
sh [Unix] Unix shell program, 753 
Shannon, Claude, 51 
shared areas, 834 | r 
shared libraries, 79, 699 
dynamic linking with, 699-701 
loading and linking from 
applications, 701+703 
shared object files, 673 
shared objects, 699, 833-836, 834 
shared resources, scheduling, 1004— 
1008 i 
shared variables, 992-995, 993 
sharing 
files, 906-908 
virtual memory for, 812 
sharing.c [CS:APP] sharing in 
Pthreads programs, 993 
shellex.c [CS:APP] shell main 
routine, 754 
shells, 7, 753 
shift arithmetic right instruction, 192 
shift left instruction, 192 
shift logical right instruction, 192 
shift operations, 57; 57—59 
for division, 103-107 
machine language, 194-196 
for multiplication, 101—103 
shift arithmetic right instruction, 
192 I f 
shift left instruction, 192 
shift logical right instruction, 192 
SHL [instruction class] shift left, 192, 
195 





SHLT [Y86-64] status code for halt, 
404 
short counts, 895 
short [C] integer data type, 40, 61 
SHR [instruction class] shift logical 
right, 192, 195 
%si [x86-64] low order 16 bits of 
register 4rsi, 180, 
side effects, 500 « 
sig atomic t type, 770 
sigaction [Unix] install portable 
handler, 775 
sigaddset [Unix] add signal to signal 
set, 765 
sigdelset [Unix] delete signal from 
signal set, 765 
sigemptyset [Unix] clear a signal set, 
765 
sigfillset [Unix] add every signal 
to signal set, 765 
sigint.c [CS:APP] catches SIGINT 
signal, 763 
sigismember [Unix] test signal set 
membership, 765 
siglongjmp [Unix] init nonlocal 
jump, 783, 785 
sign bits 
floating-point representation, 137 
two's complement representation, 
64 
sign extension, 77, 77, 183-184 
sign flag condition code, 207, 355 
sign-magnitude representation, 68 
Signal CS:APP] portable version of 
signal, 775 
signal handlers, 758 
installing, 763 
writing, 766-775 
Y86-64, 364 
signall.c [CS:APP] flawed Signal 
handler, 771 w 
signal2. c [CS:APP] flawed signal 
handler, 772 
signals, 722, 756—758 
blocking and unblocking, 764-765 
correct handling, 770-774 
enabling and disabling, 52 
flow synchronizing; 776-778 
portable handling, 774—775 
processes, 739 
receiving, 762, 762—764 
safe handling, 766-770 
sending, 758, 759-762. 
terminology, 758-759 
waiting for, 778-781 i 


Y86-64 pipelined implementations, 
426-427 
signed [C] integer data type, 41 
signed divide instruction, 198, 799 
signed integers, 32, 40, 61-62, 67 
alternate representations, 68 
shift operations, 58 
two's complement encoding, 64—70 
unsigned conversions, 70—76 
signed multiply instruction, 798, 198 
signed number representation 
guidelines, 83-84 
ones' complement, 68 
sign magnitude, 68 
signed size type, 896 A 
significands in floating-point x. 
representation, 772 
signs for floating-point representation, 
112, 112-113 
SIGPIPE signal, 964 
sigprocmask [Unix] block and 
unblock signals, 765, 781 . 
sigsetjmp [Unix] init nonlocal, 
handler jump, 781, 785 
sigsuspend [Unix] wait for a signal, 
781 
%sil [x86-64] low order 8 of register 
Xrsi,180. 
SimAquarium game, 637—638 
SIMD (single-instruction, multiple- 
data) parallelism; 26, 294, 546, 
547 
SIMD streaming extensions (SSE) 
instructions, 276 
simple segregated storage, 863, 
863-864 
simplicity in instruction processing, 
385 
simulated concurrency, 2¢ 
simultaneous multi-threading, 25 
single-bit data connections, 398 
single-instruction, multiple-data + 
(SIMD) parallelism, 26, 294, 
546-547 
single-precision floating-point 
representation E 
IEEE, 113,113 
machine-level data, 178 
support for, 41 
SINS [Y86-64] status code for illegal 
instruction exception, 404 
sio. error [CS:APP] safely emit 
error message and terminate, 
766,768 
sio ltoa [CS:APP] safely emit string, 
768 





sio. putl [CS:APP] safely emit long 
int, 766, 768 
sio puts [CS:APP] safely emit string, 
766, 768 
Sio, strlen [CS:APP] safely emit 
string, 768 
size 
blocks, 848 
caches, 632—633 
data, 39—42 
word, 8, 39 
size classes, 863 
size_t [Unix] unsigned size type for 
designating sizes, 44, 83-84, 86, 
99, 896 Te 
SIZE tool, 713 
sizeof [C] compute size of object, 45, 
129-131, 133 
slashes (/) for root directory, 892 
sleep [Unix] suspend process; 749 
slow system calls, 774 
. 80 shared object file, 699 
Sockaddr [Unix] generic socket 
address structure, 933 
sockaddr, in [Unix] Internet-style 
socket address structure, 933 
socket addresses, 930 
socket descriptors, 912, 954 
Socket function, 934 
socket pairs, 930 
sockets, 892, 930 y 
sockets interface, 932, 932-933 
accept function, 936-937 
address structures, 933-934 
bind function, 935 
connect function, 934-935 
example, 944-947 
helper functions, 942-944 
host and service conversions, 
937-942 
listen function, 935 
open. clientfd function, 934-935 
Socket function, 934 - 
Software Engineering Institute; 100 
software exceptions 
C. and Java, 786 
ECF for, 723-724 
vs, hardware, 724 
Solaris Sun Microsystems operating 
system, 16, 45 
solid state disks (SSDs), 591, 600 
benefits, 587 
operation, 600-602 
sorting performance, 566—567: 
source files, 3 
source hosts, 922 
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source programs, J 
southbridge chipsets, 588 
Soviet Union, 931 
%sp [x86-64] low order 16 bits of stack 
pointer register Zrsp, 180 
SPARC 
five-stage pipelines, 471 1 
RISC processors, 363 
Sun Microsystems processor, 45 
spare cylinders, 596 
spatial locality, 604 
caches, 643-647 
exploiting, 614 
special arithmetic operations, 197-200 
special control conditions in, Y86-64 
pipelining 
detecting, 457—459 
handling, 455—457 
specifiers, operand, 180-182 
speculative execution, 579, 519, 
549-550 
speedup of parallel programs, 1018, 
1018-1019 
spilling, register, 548-549 
spin loops, 778 
spindles, disks, 590 
%spl [x86-64] low order 8 of stack 
pointer register, Arsp, 180 
splitting 
free blocks, 849-850 
memory blocks, 847 
sprintf [C Stdlib] function, 47,282 
Sputnik, 931 
sqrtsd [x86-64] double-precision 
square root, 302 
sqrtss [x86-64] single-precision 
square root, 302 
square root floating-point instructions, 
302 
squashing mispredicted branch 
handling, 444 
SRAM (static RAM), 13, 581, 581—582 
cache. See caches and cache memory 
vs. DRAM, 582 
trends, 602-603 
SRAM cells, 587 
srand [CS:APP] pseudorandom 
number generator seed, 7021 
SSDs (solid state disks), 591; 600 
benefits; 587 ` 
operation, 600-602 i 
SSE (streaming SIMD extensions) 
instructions, 167—168, 2941 
alignment exceptions, 276 
parallelism, 546—547 
ssize, t [Unix] signed size type, 896 
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stack corruption detectiort, 286-289 
stack frames; 240, 240-241 
alignment on, 276.’ 
7 variable-size, 290-293 
stack pointers, 239 " 
stack protectors, 286-287 
stack randomization, 284—286 
stack storage allocation function, 290, 
324 
stacks, 19, 789, 189—191 
bottom, 190 
buffer overflow, 871 
with execve function; 751-752. 
local storage, 248-251 
machine-level programming, 171 
overflow. See buffer overflow 
recursive procedures, 253-255 
run time, 239-241 
top, 190 
Y86-64 pipelining, 429 
stages, SEQ, 384—395 
decode and write-back,'406-408 
execute, 408-409 a" 
fetch, 404—406 
memory stage, 409—411 
PC update, 411 
stalling 1 
for data hazards/442 
pipeline, 433-436, 459—460 
Stallman, Richard, 63:16 
standard C library, 4,4—5 
standard error files, 891 
standard I/O library, 911, 911 
standard input files, 897 
standard output files, 897 
Standard Unix Specification, 16 
_start, 698 
starvation in readers-writers problem, 


1008 

stat [Unix] fetch file metadata, * 
903-904 

state machines, 980 

States » 


bistable memory, 581 
deadlock, 7027 ` 
processor, 723 
programmer-visible, 355, 355-356 
progress graphs, 999 
state machines, 980 . 
static libraries, 684, 684—688 1 » 
static linkers, 672 
static linking, 672 
static RAM (SRAM), 73,58t1-582 
cache. See caches and cache memory 
vs. DRAM, 582 





trends, 602-603 
static [C] variable'and function 
attribute, 6761677,994 
static variables, 994, 994-995 
staticWeb content, 949 
status code registers, 435 
status codes 
HTTP, 953 
Y86-64, 363-364, 364 
status messages in HTTP, 953 
status register hazards, 435 
STDERR_FILENO [Unix] constant for 
standard error descriptor, 891 
stderr stream, 911 
STDIN, FILENO [Unix] coristant for 
standard input descriptor, 897 
stdin stream, 911 " 
stdint.h file, 67 j 
<stdio.h> [Unix] standard I/O 
library header file, 84,86 * 
stdlib, 4, 4-5 " 
STDOUT_FILENO [Urix] constant for 
standard output desctiptor, 891 
stdout stream, 911 á 
stepi command, 280 
stepi4 command?280 
Stevens, W. Richard, 903, 914, 965, 
1041 
stopped processes, 739 
storage. See also information storage 
device hierarchy, 14 di 
registers, 251-253 ^ 
stack, 248-251 
storage classes for variables,‘994—995 
store buffers, 557—558 te 
store instructions, 70 
store operatidns 
example, 588 
processors, 521 
store performance of memory, 555- 
561 d 
STRACE tool, 786 
straight-line code, 200-201 
strcat [C Stdlib] string concatenation 


function, 282 
strcpy [C Stdlib] string copy function, 
282 x 


streaming SIMD extensions (SSE) 
instructions, 167-168, 294 

alignment exceptions, 276 
parallelism, 546+547 

streams, 977 
buffers, 911 , 
directory, 905 
full duplex, 912 








strerror function, 738 

stride-1 reference patterns, 606- 
stride-k reference patterns, 606 
string-concatenation function, 282 
string copy function, 282 

string generation function, 282 


strings ^s 
in buffer overflow, 279, 281 
length, 83 


loweréase conversions; 509—511 
representing, 49 
STRINGS tool, 713 
STRIP tool, 713 
strlen [C Stdlib] string length 
function, 83, 509—511 
strong scaling, 2019 
strong symbols, 680 
. strtab section, 675 t 
strtok [C Stdlib] string function, 1024 
struct [C] structure data type, 265" 
Structures 
address, 933-934 
heterogeneous. See heterogeneous 
data structüres 
machine-level programming, 171 
SUB [instruction class] subtract, 192 
subdomains, 927. 
subq [Y86-64] subtract, 356; 388 
substitution, inline, 501 u 
subtract instruction, 192 
subtract operation in execute stage, 
408 
subtraction, floating-point, 302 
sumarraycols [CS:APP] column- 
major sum, 636 ‘ 
sumarrayrows [CS:APP] row-major 
sum, 635, 635 
sumvec [CS:APP] vector sum; 634, 
635-636 
Sun Microsystems, 45 
five-stage pipelines, 471 
RISC processors, 363 
security vulnerability,-100 
supercells, 582, 582-583 
superscalar processors, 26,471, 518 
supervisor mode, 735 
surfaces, disks, 590, 595 
suspend process function, 749 
suspend until signal arrives function, 
750 
suspended processes, 739 
swap areas, 833 
swap files, 833 
swap space, 833 
swapped-in pages, 809 
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swapped-out pages, 809 .« 
swapping pages, 809 
sweep phase in Mark&Sweep garbage 
collectors, 867 x 
Swift, Jonathan, 43 D 
switch [C] multiway branch 
statement, 232-238 
switches, context, 7367737 
symbol resolution, 673, 679 
duplicate symbol names, 680-684 , 
static libraries, 684—688 
symbol tables, 675, 675-679 
symbolic links, 892 
symbolic methods, 466 
symbols 
address translation, 814 
caches, 617 
global, 675 
local, 676 
relocation, 689-695 
strong and weak, 680 
.Symtab section, 675 
synchronization 
flow, 776—778 
Java threads, 1010 
progress graphs, 7000 ` 
threads, 995-999 
progress graphs, 999-1001 
with semaphores. See sema- 
phores 
synchronization errors, 995 
synchronous DRAM (SDRAM), 586 
synchronous exceptions, 727 
/sys filesystem, 736 
syscall function, 730 " 
system bus, 587 
system calls, 17, 727, 727—728 
error handling, 737-738 
Linux/x86-64 systems, 730-731 
slow, 774 
system-level functions, 730 
system-leyel I/O, 
closing files, 894—895 
file metadata, 903-904 
I/O redirection, 909-910 
opening files, 893—895 
‘packages summary, 911-913 
reading files, 895-897 L 01 
RIO package, 897—903 
sharing files, 906-908 
standard, 911 
summary, 913-914 
Unix I/O, 890-891 
writing files, 896-897 
system startup, function, 698 


System V Unix, 16 
semaphores, 977 
shared memory, 977 
à i 
T2B (two’s complement to binary 
conversion), 60, 65, 71 
T2U (two's complement to unsigned 
conversion), 60, 77, 71-73. ų„ 
tables » 
descriptor, 907, 909 
exception, 725,725 i 
GOTs, 705, 705-707 
hash, 567-568 
header, 674, 696 t 
jump, 233, 234-235, 725 
page, 736, 806-808, 807, 819-821, 
823 
program header, 696, 696 
symbol, 675, 675-679 
tag bits, 615, 616 
tags, boundary, 857, 851-854, 859 
Tanenbaum, Andrew S., 20 
target functions in interpositioning 
libraries, 708 
targets, jump, 206, 206-209 
TCP (Transmission Control Protocol), 
924 
TCP/IP (Transmission Control 
Protocol/Internet Protocol), 
924 
tcsh [Unix] Unix shell program; 753 
TELNET remote login program, 950, 
950-951 
temporal locality, 604 
blocking for, 647 
exploiting,614 
terminate another thread function, 
989 ide 
terminate current thread function, 989 
terminate process function, 739° 
terminated processes, 739 1 
terminating 
processes, 739—743 
threads, 988—989, 
TEST [instruction class] Test, 202 
test byte instruction, 202 
test double word instruction, 202 
test instructions, 202 
test quad word instruction, 202. « 
test signal set membership instruction, 
765 D 
test word instruction, 202 
testb [x86-64] test byte, 202 
testing Y86-64 pipeline design, 465 
testl[x86-64] test double word, 202 
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testq [x86-64] test quad word, 202 
testw [x86-64] test word, 202 
text files, 3, 891, 892, 900 
text lines, 897, 898 ahs f 
text representation i 
ASCH, 49 4 
Unicode, 50 i 
„text section, 674 
Thompson, Ken, 16 
thrashing : 
direct-mapped caches,,622, 622-623 | 
pages, 810 f 
thread contexts, 986, 993 | 
thread IDs (TIDs), 986 
thread-level concurrency, 24-26 
thread-level parallelism, 26 
thread routines, 987, 988 
thread-safe functions, 7020, 1020-1022 
thread-unsafe functions, 1020, 1020- 
1022 
threads, 77, 18, 973, 985-986 
concurrent server based on, 991- 
992 i 
creating, 988 
detaching, 989-990 
execution model, 986-987 
initializing, 990 
library functions for, 1024-1025 
mapping variables in, 994-995 
memory models, 993-994 
for parallelism, 1013-1018, 
Posix, 987-988 
races, 1025-1027 
reaping, 989 
safety issues, 1020-1022 
Shared variables with, 992-995, 993 
synchronizing, 995-999 
progress graphs, 999-1001 
with semaphores. See sema- 
phores 
terminating, 988-989 
three-stage pipelines, 414—416 
throughput, 524 
dynamic memory allocators, 845 
pipelining for. See pipelining 





read, 639 * 
throughput bounds, 518, 524 
TIDs (thread IDs), 986 


time slicing, 733 

timing, SEQ, 400-403 

Tiny [CS:APP] Web server, 956, 
956-964 

TLB index (TLBI), 827 

TLB tags (TLBT), 817, 823 

TLBI (TLB index), 877 
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TLBs (translation lookaside buffers), 
470, 817, 817-825' E 
TLBT (TLB tags) «877} 823 
TMax (maximum two's complement 
number), 60, 65, 66 
TMin (minimum two's complement 
number), 60, 65, 66, 77 
top of stack, 790, 190 
TOP tool, 786 
topological sorts of vertices, 742 
Torvalds, Linus, 20 
touching pages, 833 
TRACE method, 951 
tracing execution, 387, 394-395, 403 
track density of disks; 591 
tracks, disk, 590, 595 
trajectories in progress graphs, 1000, 
+ 1000 
transactions’ 
bus, 587, 588-589 
client-server model, 918 
client-server vs. database, 919 
HTTP, 950-953 
transfer time for disks, 594  ' 
transfer units, 672 i 
transferring Gontrol, 241—245 
transformations, reassociation, 547, 
541-546, 570 
transistors in Moore’s:Law, 169 
transitions 
progress graphs,!999 
state machines, 980 
translating programs, 4-5 
translation 
address. See address translation 
switch! statements,'233 
translation lookaside buffers (FLBs), 
470! 817, 8174825 


Transmission Control Protocol (TCP), 


924 

Transmission Control Protocol/ 
Internet Protocol (ROEIEN 
924 

trap exception class727 

traps, 727; 721—728 

tree height reduction, 570 

tree structure, 270-271 


truncating numbets, 81-82 
two-operand multiply instructions, 
198 "s 


two-way parallelism4536-537 
two's-complement representation 
addition, 90-95  ' 
asymmetric range, 66, 77 
bit-level representátion, 96 


encodings, 32 

minimum value, 65 

multiplication, 97-101 

negation, 95 

signed and unsigned conversions, 

70-74 

signed numbers, 64, 64-70 
typedef [C] type definition, 44, 47 
types 

conversions. See conversions 

floating point,124-126 

integral, 60, 60-62 

machine-level, 171, 177-178 

MIME, 949 

naming, 47 

pointers, 36, 277 

pointers associated with, 34 


U2B (unsigned to binary conversion), 
60, 64, 71, 74 

U2T (unsigned to two's-complement 
conversion), 60, 72, 73, 82 

ucomisd [x86-64].compare double 
precision, 3067 

ucomiss [x86-64] compare single 
précision, 306 

UDP (Unreliable Datagram 
Protocol), 924 

UINT MAX constant, maximum 
unsigned integer, 68 

UINTN MAX [C] maximum-value of 
N-bit unsigned data type, 67 

uintN_t [C] N-bit unsigned integer 
data type, 67 

umask function, 894-895 

UMax (maximum unsigned dumber), 
63, 66-67 ' 

unallocated pages, 805 

unary operations, 194 

unblocking signals, 764—765 

unbuffered input and output, 897-898 

uncached pages, 806 

unconditional jump instruction, 357 

underflow, gradual; 775 

Unicode characters, 50 


unified caches, 637 * 
uniform resource identifiers'(URIs), 
951 


uninitialized memory, reading, 871 

unións, 44, 269-273 

uniprocessor systems, J6, 24 

United States, ARPA creation in, 931 

universal resource locators (URLs), 
949 Zi 

Universal Serial Bus (USB), 596 





Unix 4.xBSD, 16, 932, 
unix error [CS:APP] reports Unix- 
style errors, 738, 738, 1043 
Unix IPC, 977 
Unix operating systems, 76, 16, 35 
constants, 746 
error handling, 1043,.1043 
T/O, 19, 890, 890-891 * 
Unix signals, 759 
unlocking mutexes, 1003 
unmap disk object function, 839' 
unordered, floating-point comparison 
outcome, 306 
unpack and interleave low packed 
double precision instruction, 298 
unpack and interleave low packed 
single precision instruction, 298 
Unreliable Datagram Protocol 


(UDP), 924 
unrolling 
kx1,531 ? 
k x 1a, 544 


k x k, 539-540 
loops, 502, 504, 531, 531—535, 572 
unsafe regions in progress graphs," 
1000 
unsafe trajectories in progress graphs, 
1000 
unsetenv [Unix] delete environment 
variable, 752 
unsigned [C] integer data type, 41, 61 
unsigned representations, 83-84 
«addition, 84-90 
conversions, 70-76 © 
division, 198, 199 
encodings, 32, 62-64 
integers, 40 
maximum value, 63 
multiplication, 96-97, 198, 198 
unsigned size type, 896 
update instructions, 9-10 
URIs (uniform resource idenfifiers)/- 
951 
URLs (universal resource locators), 
949 
USB (Universal Serial Bus), 596 
user-level memory mapping, 837-839 
user mode, 726 
processes, 734—736, 735 
regular functions in, 728 
user stack, 79 
UTF-8 characters, 50 


V [CS: APP] wrapper function for 
Posix sem, post, 7002 
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v-node tables; 906 - 

V semaphore-operation,; 7001, 10015 
1002 

VA. See virtual addresses (VA) 

vaddsd [x86-64] double-precision 
addition, 302 oy ot ` 

vaddss [x86-64] single-precision 
addition, 302 
VALGRIND program, 569 
valid bit 
cache lires, 645: 
page tables, 807 

values, pointers, 36, 277 

vandpd [x86-64] and packed double 
precision, 305 m 

vandps [x86-64] and packed single 
precision, 305 : 

variable-size stack.frames, 290—293 

variable-size arrays, 262-265 x 

variables *' «r r? 

mapping, 994-995 
nonexistent, 874 
shared, 992-995, 993 
storage classes, 994-995; 

VAX computers (Digital Equipment 
Corporation), Boolean 
operations, 56 

vevtps2pd [x86-64] convert — 
single to packed double 
precision, 298 

vcvtsi2sd [x86-64] convert integer to 
double precision, 297 

vcvtsi2sdq [x86-64] convert quad- 
word integer to double precision, 
297 

vcvtsi2ss [x86-64] convert integer to 
single precision, 297 

vcvtsi2ssq [86:64] convert quad- 
word integer to single precision; 
297 

vcvttsd2si [x86-64] convert double 
precision to integer, 297 

vcvttsd2siq[x86-64] convert double 
precision to quad-word integer, 
297 

vcvttss2si.[x86-64] convert single 
precision to integer, 297 

vcvttss2siq [x86-64] convert single 

* precision to Quad-word integer, 
297 T 

vdivsd [x86-64) double-precision 
division, 302 

vdivss [x86-64] single-precision 
division, 302 

vector data types, 26, 504—507 


vector dot product function, 622 
vector registers, 171, 546 
vector sum function, 634, 635-636 
vectors, bit, 57, 51—52 
verification in pipelining, 466 <. 
Verilog hardware description language 
for logic design, 373 » .u 
Y86-64 pipelining implementation, 
467 
vertical bars | | for or operation, 373 
VHDL hardware description: 
language, 373 Áo. 
victim blocks, 612 Ad 
Video RAM (VRAM), 486 
virtual address spaces} 18, 34, 804 


virtual addresses (VA) 
machine-level programming, 170+ 
171 
vs. physical, 803-804 
Y86-64, 356 4 


virtual machines 
as abstraction, 27 
Java byte code, 310 
virtual memory (VM), 15, 18, 34, 


802 2? 
as abstraction, 27 ta 
address spaces, 804-805 


address-translation. See address 
translation 

bugs, 870-875 

for caching, 805-811 , 

characteristics, 802-803 1 

Core i7, 825-828 a 

dynamic memory allocation. See 
dynamic memory allocation 

garbage collection, 865-870 „r 

Linux, 830-833 

in loading, 699 

managing, 839 . 

mapping. See memory mapping 

for memory management, 811-812 

for memory protection, 812-813 


overview, 18—19 WO 
physical vs. virtual addresses, 
803-804 x 


summary, 875-876 ! 

virtual page numbers (VPNs), 814 

virtual page offset (VPO), 814, 

virtual pages (VPs), 289, 805, 805-806 

viruses, 285-286 

VLOG implementation of Y86-64 
pipelining, 467 

VM. See virtual memory (VM) 

vmaxsd [x86-64] double-precision 
maximum, 302 
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vmaxss [x86-64] single-precision. ..« 
maximum, 302 

vminsd [x86-64] double-precision 
minimum, 302 

vninss [x86-64] single-precision 
minimum, 302 

vmovapd [x86.64] move aligned, 
packed double precision,.296 

vmovaps [x86-64] move aligned) | 
packed single precision, 296 

vmovsd [x86-64] move double 
precision, 296 

vmovss [x86-64] move single precision, 
296 

vmulsd [x86-64] double+precision 
multiplication, 302 

vmulss [x86-64] single-precision 
multiplication; 302 

void* [C] untyped pointers, 48 

volatile [C] volatile type. qualifier, 
769-770 

VP (virtual pages), 289, 8054805-806 

VPNs (virtual page numbers), 8/4! 

VPO (virtual page offset), 874 " 

VRAM (video RAM), 586. . 

vsubsd [x86-64] double-precision,: 
subtraction, 302 

vsubss [x86-64] single-precision 
subtraction::302 

VIUNE program, 569: 2 b 

vulnerabilities, security, 86-87 

vunpcklpd [x86-64] unpack and 
interleave low:packed double 
precision, 298 

vunpcklps [x86-64] unpack and 
interleave low packed single 
precision, 298 

vxorpd [x86-64]-EXCLUSIVE-OR packed 
double precision, 305 

vxorps [x86-64] EXCLUSIVE-oR packed 
single precision, 305 


vait [Unix] wait for child process, 746 

wait for child process functions, 744, 
746—749 t 

wait for client connection request: 
function, 936, 936—937 

wait for signal instruction, 781 

wait.h file, 746 

wait Sets, 744, 744 

waiting for signals, 778-781 

vaitpid [Unix] wait for child process, 
743, 746-749 

waitpidt [CS:APP] waitpid 
example, 747 





1084 


Index 


waitpid2 [CS:APP] waitpid 
example, 749 
WANs (wide area networks), 927, 
921-922 
warming up caches, 612 
WCONTINUED constant, 744 
weak scaling, 7019, 1020 
weak symbols, 680: 
wear leveling logic, 601 
Web clients, 948, 948 
Web servers, 701, 948 
"básics, 948-949 
dynamic content, 953-954 
HTTP transactions, 950-953 
Tiny example, 956-964 
Web content, 949-950 
well-known ports, 930 
well-known service names, 930 
while [C] loop statement, 223-228 
wide area networks (WANs), 921, 
921-922 
WIRBEXITED constant, 745 
WIFEXITSTATUS constant, 745 
WIFSIGNALED constant, 745 
WIFSTOPPED constant, 745 
Windows Microsoft operating system, 
27,45 
wire names in hardware diagrams, 398 
WNOHANG constant, 744—745 
word-level combinational circuits, 
376-380 
word selection 
direct-mapped caches,'619 
fully associative caches, 627-628 
set associative caches, 625-626 
word size; 8, 39 
words, 8, 177 
working sets, 613, 810 
world-wide data connections in 
hardware diagrams, 398 
World Wide Web, 949 
worm programs, 284-286 
wrapper functions, 711 
error handling, 738, 1041, 1043-1045 
interpositioning libraries, 708 
write‘access, 289 
write-allocate approach, 630 
write-back approach, 630 
write-back stage 
instruction processing, 385, 387—397 


PIPE processor, 449-453 
sequential processing, 400 
sequential Y86-64 implementation, 
406-408 
write [Unix] write file, 895; 896-897 
write hits, 630 
write issues for caches, 630-631 
write-only register, 527 
write operations for files, 897, 896- 
897 
write ports 
priorities, 408 
register files, 382 
write/read dependencies, 557—559 
write strategies for caches, 633 
write-through approach, 630 
write transactions,587, 588-589 
writen function, 903 
writers in readers-writers problem, 
1006, 1008 
writing 
signal handlers, 766-775 
SSD oprations, 600 
WSTOPSIG constant, 745 
WTERMSIG constant, 745 
WUNTRACED constant, 744—745 


x86 Intel microprocessor line, 166 
x86-64 instruction set architecture vs. 
Y86-64,360 
x86-64 microprocessors, 168 
array access, 256 
conditional move instructions, 
214-220 
data alignment, 276 
exceptions, 729-731 ^ 
Intel-compatible 64-bit micropro- 
cessors, 45 
machine language, 165-166 
registers 
data movement, 182-189 
operand specifiers, 180-182 
vs. Y86-64, 365-366 
x87 microprocessors, 167 
XDR library security vulnerability, 
100 
%xmm [x86-64] 16-byte media register. 
Subregion of YMM, 295 
%xmm0, return floating-point value 
register, 299, 301 





XMM, SSE vector registers; 294-296 

xor.[instruction class] EXCLUSIVE-OR, 
192 

xorq [Y86-64] EXCLUSIVE-OR, 356 


Y86-64 instruction set architecture, 
353-354, a 
details, 370-372 
exception handling, 363-364 
hazards, 435 
instruction encoding, 358-360 
instruction set, 356-358 
programmer-visible state, 355- 
356 
programs, 364—370 
séquential implementation. 
See sequential Y86-64 
implementatiofi 
vs. x86-64, 360 og 
Y86-64 pipelined implementations, 
421 
computation stages, 421-422 
control logic. Seecontrol logic in 
pipelining 
exception handling, 444—447 
hazards. See hazards in pipelining 
memory system interfacing, 469— 
470 ie 
multicyéle instructions, 468-469 
performance analysis, 464—468 
predicted values, 427—429 
register insértions, 422-426 
signals, 426-427 
stages. See PIPE processor stages 
testing, 465 
verification, 466 > 
Verilog, 467° 
yas Y86-64 assenibler, 366 
vis Y86-64 instruction set simulator, 


366 
%ymm [x86-64] 32-byte-media register, 
295 nam 


YMM, AVX vector registers, 294-296 


zero extension, 77 

zero flag condition code, 201, 306, 355 

ZF [x86-64] zero flag condition code, 
-201, 306, 355 

zombie processes, 743, 743—744, 770 

zones, recording, 592 
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